Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 859 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Gideon

Golden Member
Nov 27, 2007
1,842
4,379
136
It does have the similar latency in cycles, but worse absolute latency [ns].
Yeah, that's true. I wish we had (relatively) apples to apples latency comparison between M4 @ 4.4 Ghz and Zen 5 to see what are the actual latency. The only info that chipsandcheese has up is quite out of date 7950X vs M1 (from here):


Roughly 5.4ns for M1 vs 2.4ns for 7950X. So yeah, a very significant over ~2x difference.

But M1 clocked only up to 3.2 M4 clocks to 4.4 GHz. I'm more than certain Apple relaxed the L2 latency by a few cycles doing that, but I'd still very much like to see where they ended up (and i hope reviewers measure it).

The AMD L2 is extremely tight, you are not increasing it's size at all without a latency regression. You are absolutely not sharing it with anything without a latency regression.

That's true, it's not possible without some latency regression. My whole point was that with the ever-more-prevalent 3D cache there is a growing gap between the rather small 1MB L2 and the gigantic 96MB L3

Take techpowerup reviews as an example (as they use the same mobo and ram configs):

For Ryzen 9700X review they registered 7.7ns L3 latency
For Ryzen 7700X it was 9.9ns L3 latency
For Ryzen 7800X3D it was 12.7ns L3 latency for 3x bigger cache

A <30% regression for 3x the size. Looks to be a pretty decent tradeoff (and I expect it to be less for 9800X3D as it clocks higher!).

But then again, the latency gap between L2 and L3 went from 3x to 4x.

As many consumer applications are heavily cache/memory bound, there seems to be performance there, waiting to be extracted.

What options are there to do that?

1. Adding extra cache layers - possible, but numerous other significant drawbacks
2. Upsizing the private L2 to 2MB or 3MB (as Intel did) - this is the easiest solution, but even with "just" 2MB of L2, we use 16MB of the CCD's SRAM budget on L2, while limiting the amount a single thread can use to 2 MB. Going beyond that (3MB for 24MB total) seems insanely wasteful to me.
3. sharing the L2 between cores - a much more complex solution with obvious latency regressions as you stated

Extrapolating what AMD did with L3 it should be possible to go from 1MB to 3MB with a 30% latenchy increase (3ns -> 4ns). Actually i think AMD would do better, as AFAIK going from 512KB to 1MB AMD managed to regress much less than that!

TL;DR: So a private 2MB L2 is indeed the most obvious solution to address this.

It's just that in my La La land, I'd like to see a shared L2 solution where the banks next to the core have almost no latency regression and the ones further away have 20-30% but allow a core to use up to 8MB of L2 instead of "just" 2MB.

The intriguing alternative is to keep the L2 at 2MB on the base SKU and take the "2-3 cycle hit” on 3D cache parts by also double the private L2 on the V-cache die to 4MB (keeping the relative latency between L2 and L3 the same)
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
If Zen continues to get wider and slower, than I can see them using the reduced clockspeed targets to double the size of the L2 while keeping the same number of cycles of latency. That should help with throughput a bit.

The next iteration of cache die will likely be on TSMC N4C as SRAM scaling falls off a cliff after N5, and won't really recover another increment of shrinking until BSPD and GAA get applied to it, which is going to be a few years after those things are in volume in the core dies as they will be very expensive processes on a per wafer basis. N4C should allow a bit of a shrink, especially if it's a cache targeted chip design, while allowing better performance/power curves. If the CCD stays roughly it's current size, then they could likely fit more cache on the cache die, so a 50-100% increase wouldn't be unreasonable. It's possible that they may decouple it into an L4 cache to allow the first 32MB of L3 to have a lower latency, at the expense of a few extra cycles of RAM latency.
 
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
8,149
6,861
136
BTW, if it is true and L3 die is below, then why not make SRAM amount > 64 MB? There would be room for more on the die.

Increase in hit time may not be worth the added capacity for most apps. There are already a lot that don't gain anything from the v-cache and increasing the latency hurts the performance for all of those.

So 5.2 max boost is pretty much confirmed. If thermal restraint was lifted, why is boost still .3 Ghz down from non V-cache model??

It may still be voltage constrained, limiting the clock speed.

Another possibility is binning/market segmentation. If you want the faster boost you'll have to shell out for a 9900X3D or a 9950X3D.
 

Josh128

Senior member
Oct 14, 2022
511
865
106
9800X3D Blender Open Data entry. 11% faster than 9700X. OC maybe? Or maybe its due to its 120W TDP vs the original 65W TDP of 9700X. Its massively faster than 7800X3D.




 

StefanR5R

Elite Member
Dec 10, 2016
6,056
9,106
136
Roughly 5.4ns for M1 [= 12 MB shared L2$] vs 2.4ns for 7950X [= 1 MB private L2$].
Or 3.8 ns for Telum [= 32 MB private L2$¹]
But that's when neither die area nor power consumption are of immediate concern.²

________
¹) of which parts can be dynamically repurposed into shared virtual L3$ (12 ns on average) or shared virtual L4$ even (which is off-chip cache).
²) almost a square inch of 7nm Samsung silicon for an 8-core chiplet, with 200 W power budget — but this is a real and honest way to obtain a ticket to La La Land. ;-)
 

Gideon

Golden Member
Nov 27, 2007
1,842
4,379
136
It's possible that they may decouple it into an L4 cache to allow the first 32MB of L3 to have a lower latency, at the expense of a few extra cycles of RAM latency.
In client (where you can probably only afford design with modification for both mobile and desktop) I'd much rather have an SLC instead of L4.

Each layer of cache adds extra complications, more tags to keep track of, etc.

That's one of the reasons Apple and Qualcomm forego L3. Unless you can afford to make the L3 big enough (say 24GB - 32GB+) you might be better off with bigger shared L2 caches and a SLC, that also benefits the GPU, NPU ...
 
Reactions: Tlh97

inquiss

Senior member
Oct 13, 2010
250
354
136
In client (where you can probably only afford design with modification for both mobile and desktop) I'd much rather have an SLC instead of L4.

Each layer of cache adds extra complications, more tags to keep track of, etc.

That's one of the reasons Apple and Qualcomm forego L3. Unless you can afford to make the L3 big enough (say 24GB - 32GB+) you might be better off with bigger shared L2 caches and a SLC, that also benefits the GPU, NPU ...
That's a giant L3 you're not gonna see in a long time...
 

Gideon

Golden Member
Nov 27, 2007
1,842
4,379
136
That's a giant L3 you're not gonna see in a long time...
Yeah, that's why you're not gonna see L3 on qualcomm / apple SoCs.

At least until they are 90% mobile focused. Apple's rumored server SKUs might actually have L3 and it might trickle down to higher end desktop / M Max SKUs
 

inquiss

Senior member
Oct 13, 2010
250
354
136
Yeah, that's why you're not gonna see L3 on qualcomm / apple SoCs.

At least until they are 90% mobile focused. Apple's rumored server SKUs might actually have L3 and it might trickle down to higher end desktop / M Max SKUs
Well, yeah that and how big or high Gigabyte size L3s would be. It's absurd.
 

yuri69

Senior member
Jul 16, 2013
574
1,017
136
If Zen continues to get wider and slower, than I can see them using the reduced clockspeed targets to double the size of the L2 while keeping the same number of cycles of latency. That should help with throughput a bit.
From my layman PoV investing into increasing L2 size does not yield much in terms of general-purpose IPC. Zen 4 doubled the Zen 3's relatively small 512kB L2. Yet, it was trailing the rest of "major IPC contributors" with sub-2% IPC points. Intel went the same odd L2-growing route since Willow... 512kB -> 1.25MB -> 2MB -> 2.5/3MB.
 

Det0x

Golden Member
Sep 11, 2014
1,346
4,545
136
9800X3D Blender Open Data entry. 11% faster than 9700X. OC maybe? Or maybe its due to its 120W TDP vs the original 65W TDP of 9700X. Its massively faster than 7800X3D.


View attachment 110463

View attachment 110464
More optimzed V/F curve
X3D have always had a better one, but it have been capped by temps and voltage limits in the past. (look at 7950X3D vs 7950X at lower PPT limits (sub 160w) and compare the efficiency)

Now that Z5X3D are unhindered by temperature and nearly all voltage limits are lifted, the new V/F curve finally get the time to shine... You will see the Z5X3D models beat vanilla Zen5 in pretty much all MT workloads @ stock PPT limits
The only remaining place where regular Zen5 wins is in the light ST workloads since the v-cache cant handle much more than 5.7ghz (when silicon pushed to the limit)
 
Last edited:

SteinFG

Senior member
Dec 29, 2021
684
804
106
If it's priced higher than 450 it's getting negative reviews for sure.
7800X3D was selling in large numbers at $350 just a year or so after its launch. I think AMD can just price it at 450 and give smaller discounts later, it's better long-term.
But it's AMD, they like to miss.
 
Reactions: Josh128
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |