Discussion Zen 5 Architecture & Technical discussion

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
6,056
9,106
136
Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.
The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
6,759
14,682
136
The graph does not show core efficiency. It is SIR rate-1 performance over package power. Operating system and maybe the test regime may have kept too much of the other components in the package from powering down deeply enough to make single core power usage a dominant part of package power consumption.

The other graph which is based on VDDCR data from MMIO read requests to the SMU (in which case the act of observing the data unfortunately influences the data) shows the classic core and dense core closer to each other WRT SIR r1 perf/W.
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.
 

Geddagod

Golden Member
Dec 28, 2021
1,296
1,368
106
TL;DR - cores dangerously close to each other, collisions expected.

The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
He noticed the same thing with Zen 4 vs Zen 4 Dense IIRC. Though that seemed to be much closer than the difference that is seen here.
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
Pretty sure Skymont in LNL doesn't even have access to the L3, much like the LPE cores in MTL. They do have the SLC, but still...
Wait till Skymont appears in ARL.
 

coercitiv

Diamond Member
Jan 24, 2014
6,759
14,682
136
I would imagine the fact that the Zen 5C cluster has less L3 cache would play a role here too?
The workload plays a role (affinity for cache) and the very limited method of measurement plays another (probably bigger) role. Nevertheless, I think we're more informed about the subject than we were before Huang published the data, as long as we don't attempt to draw definitive conclusions.
 

StefanR5R

Elite Member
Dec 10, 2016
6,056
9,106
136
I'm aware of both the issues, that's why I said that for now all we can do is to take it for what it's worth, essentially observe the strange result and raise an eyebrow.
Admittedly I didn't read your other post very carefully, as I was already preoccupied with typing my post... :-)

Phoenix 2 was later to market than Genoa + Bergamo. According to low resolution die shots (Phoenix 2/ Durango/ Vindhya), Zen 4-classic and Zen 4-dense core layouts in Phoenix 2 on 4nm are still looking the same as Genoa's and Bergamo's on 5nm. That is, all the optimization work which went into Bergamo relative to Genoa, in order to press 33% more fully-featured cores into the same socket power envelope, should reflect in Phoenix 2.

Strix Point on the other hand is much earlier ready to market than Turin-dense, and the latter is designed for a newer manufacturing node. (Plus, Strix Point's Zen 5 and Zen 5c incarnations are somewhat cut down from Turin's and Turin-dense's, though basically just in the FP department.) This is not the speculation thread, but I can't help myself wondering whether power efficiency tweaking of Strix Point's dense cores was somewhat cut short for sake of time to market, and due to being a 3nm/4nm co-design to some extent.
 
Reactions: Tlh97 and Elfear

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
Skymont in Lunarlake performs better 41% in both SpecInt and Geekbench Integer per clock(assuming 90% clock scaling) while in FP it performs 60% faster in GB5 and 70% faster in GB6 compared to Crestmont LP, which is almost exactly as Intel claims.

Now we're waiting for the ring version which Intel said will outperform Raptorlake by 2% in Spec.

Even though Lunarlake's version is not that performant, it scales extremely low in power and is very power efficient.
No access to L3, and based on info from the lead SoC architect who worked on the project, the SLC is there as more of an bandwidth saving solution.
This would make sense as higher performing caches need a tradeoff as larger cell sizes and/or lower power efficiency due to higher leakage and outright higher power use.
 
Last edited:
Reactions: Tlh97
Jul 27, 2020
20,898
14,487
146
I think a special game mode could be developed for the dual chiplet Ryzens, using a combination of Windows and BIOS trickery, where the lower binned CCD acts essentially as a zombie CCD for the preferred CCD. The zombie CCD's threads don't do any actual work. But they help to prefetch data for the preferred CCD threads and when needed, provide that data, acting as a dedicated virtual cache CCD, using its all available L2+L3 capacity for this purpose and reducing the instances of waiting on data from system RAM. How hard would this be?
 

StefanR5R

Elite Member
Dec 10, 2016
6,056
9,106
136
L3$ is CCX-private and L2$ is core-private. Prefetching into a certain core's L2$ does nothing for a different core but steal memory access bandwidth (and while doing so, increases memory access latency). I am not sure whether prefetching into one CCX's L3$ would help with latency of the other CCX's L3$ misses, but in any case all sorts of more aggressive prefetch policies run the risk of regressing due to memory access bandwidth and latency limitations.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,505
2,080
136

I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.
 
Reactions: lightmanek

naukkis

Senior member
Jun 5, 2002
962
829
136
I think the vector side is annotated wrong. The unit is clearly split into 4 quarters, with the execution units tightly bound with the register files. Instead of having 256-bit units per quarter, I think each quarter is a 128-bit slice across the entire SIMD pipe. This would make maximum communication distances much shorter from registers to units.

You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.
Any data to back this up? Optimization manual doesn't talk about it, it's not included in any latency tables that I know of. And the way you state it, it sounds different to the one additional latency cycle due to schedulers being full. [that applies only to instructions that would themselves have only 1 cycle of latency, anything with longer latency is not affected].
 

Bigos

Member
Jun 2, 2019
159
397
136
Another explanation would be the register file being duplicated to both the top and the bottom parts. This is done usually in order to increase the number of read ports of a register file at the cost of area and power. Writes to the register file have to access both parts so the number of write ports does not increase in this scheme.

And with complex enough forwarding network, the latency of register file writes does not matter much as it can be delayed - subsequent operations can source their results from either the back end of execution units directly or from the forwarding network "buffers".

This might still mean that at the critical path an operation done on the bottom EU needs to be forwarded to the top EU immediately (or vice-versa), all within the same cycle. This might be easier to do with the EU as they are closer to each other than the register files (and the middle area is probably mostly the forwarding network itself).
 
Reactions: Vattila

Tuna-Fish

Golden Member
Mar 4, 2011
1,505
2,080
136
You can't slice AVX512 instructions efficiently. For being able to do fast 512 bit permutations register file needs to hold whole 512 bit register. So those are 256 bit quarters and those nearby are combined for executing 512 bit instructions( or there's just two double-pumped pipelines). And there are those two different register file clusters so when dependent instructions are scheduled to different clusters there's additional one cycle latency involved.

That's if you try to make wide permutes very low latency. I find that AMD's optimization manual for Zen5(zip, contains pdf and spreadsheet with latencies) is very illuminating here.

Notably:

All types of VPERMW emit the same amount of Mops, and have same high throughput of 2 per cycle. However...
VPERMW x, x, x (that is, 128-bit registers) has a latency of 2 cycles.
VPERMW y, y, y (256-bit regs) has a latency of 4 cycles.
VPERMW z, z, z (512-bit regs) has a latency of 5 cycles.

This to me screams a horizontally split implementation. And it's also, at least to me, the technically better way to build it, because it means you only pay for lane-crossing when you actually intentionally cross lanes.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |