It does have the similar latency in cycles, but worse absolute latency [ns].
Yeah, that's true. I wish we had (relatively) apples to apples latency comparison between M4 @ 4.4 Ghz and Zen 5 to see what are the actual latency. The only info that chipsandcheese has up is quite out of date 7950X vs M1 (
from here):
Roughly 5.4ns for M1 vs 2.4ns for 7950X. So yeah, a very significant over ~2x difference.
But M1 clocked only up to 3.2 M4 clocks to 4.4 GHz. I'm more than certain Apple relaxed the L2 latency by a few cycles doing that, but I'd still very much like to see where they ended up (and i hope reviewers measure it).
The AMD L2 is extremely tight, you are not increasing it's size at all without a latency regression. You are absolutely not sharing it with anything without a latency regression.
That's true, it's not possible without
some latency regression. My whole point was that with the ever-more-prevalent 3D cache there is a growing gap between the rather small 1MB L2 and the gigantic 96MB L3
Take techpowerup reviews as an example (as they use the same mobo and ram configs):
For Ryzen 9700X review they registered
7.7ns L3 latency
For Ryzen 7700X it was
9.9ns L3 latency
For Ryzen 7800X3D it was
12.7ns L3 latency for 3x bigger cache
A <30% regression for 3x the size. Looks to be a pretty decent tradeoff (and I expect it to be less for 9800X3D as it clocks higher!).
But then again, the latency gap between L2 and L3 went from 3x to 4x.
As many consumer applications are heavily cache/memory bound, there seems to be performance there, waiting to be extracted.
What options are there to do that?
1. Adding extra cache layers - possible, but numerous other significant drawbacks
2. Upsizing the private L2 to 2MB or 3MB (as Intel did) - this is the easiest solution, but even with "just" 2MB of L2, we use 16MB of the CCD's SRAM budget on L2, while limiting the amount a single thread can use to 2 MB. Going beyond that (3MB for 24MB total) seems insanely wasteful to me.
3. sharing the L2 between cores - a much more complex solution with obvious latency regressions as you stated
Extrapolating what AMD did with L3 it should be possible to go from 1MB to 3MB with a 30% latenchy increase (3ns -> 4ns). Actually i think AMD would do better, as AFAIK going from 512KB to 1MB AMD managed to regress much less than that!
TL;DR: So a private 2MB L2 is indeed the most obvious solution to address this.
It's just that in my La La land, I'd like to see a shared L2 solution where the banks next to the core have almost no latency regression and the ones further away have 20-30% but allow a core to use up to 8MB of L2 instead of "just" 2MB.
The intriguing alternative is to keep the L2 at 2MB on the base SKU and take the "2-3 cycle hit” on 3D cache parts by also double the private L2 on the V-cache die to 4MB (keeping the relative latency between L2 and L3 the same)