Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 690 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

LightningZ71

Golden Member
Mar 10, 2017
1,777
2,134
136
Thinking about the Zen5c CCX on Strix Point, that's the least amount of L3 cache per thread since Lucienne. Yes, the L2 is twice as large and still exclusive, so the apparent L2/L3 cache is 33% larger, but Zen5c, even restricted to ~3.6Ghz, is much higher performance that Zen2 at up to 4-4.5Ghz. This becomes an even bigger issue with AVX-512 code as that can quickly become memory throughput bound.

All of that is to say that the Zen5c CCX is very memory bandwidth starved and will be hitting the memory controller and snooping the other CCX a lot. This seems like a very hamstrung design unless the memory controller has gobs of bandwidth available at low latency. A MALL cache would have made this make sense. The current design just exacerbates existing problems with limited bandwidth.

This leads me to believe that AMD changed course hard after they had frozen critical aspects of Strix Point's design, like the rumored NPU expansion that removed the MALL. They wanted to hit a marketing goal of having a processor that looked like Alder Lake/Raptor Lake, but things went sideways along the way and they didn't course correct well.
This also shows up in the iGPU. Adding an additional WGP and increasing the clocks makes sense if you are anticipating lots of bandwidth. It's barely higher than Hawk and iGPU performance, especially for non-compute bound tasks, is already showing to be only modestly better.

Had they planned for this originally, it would have made more sense to ax the extra WGP and, instead, double the L2 to 4MB while keeping the clock bump. Total floor plan area would be broadly similar, but there would need to be a bit of rearranging. This change would also have helped to mitigate some of the memory contention issues the processor faces. What it would have negatively impacted is AI workloads as there would be less compute available for it, though, with less thermal load generated, the apu would be able to maintain higher clocks for the iGPU in thermally limited laptops.

That MALL cache change, if real, really impacted this whole thing.
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,034
136
I'm amazed that this comment got jumped on (-7 downvotes? ).

I made that comment in the context of a few people discussing how some of Zen5's architectural changes seemed questionable (which could also be said of Bulldozer).

I don't think they're all that questionable (aside from perhaps the CCX layout but I've had issues with heterogeneous multicore for a while.) Zen5 is a performance and efficiency improvement over Zen4 with no major shrink. It's fine. It's a better uarch than Zen4. Folks just got a little over-enthusiastic, egged on by some of the usual suspects, in the run-up to release.
 

Malachijtjfjf

Member
Oct 9, 2022
26
42
51
But this also shows that ZEN5 need more juice compared to ZEN4. The advantage at low wattages is pretty meh for 50% more threads. It only starts getting decent at power levels above ZEN4s sweetspot. I'm 80% sure we will see that every ZEN5 Desktop SKU is slower than it's predecessor at low wattages. Similar to Igors Leaks, the ZEN5 ES was besten by 7950X pretty much everywhere below 100W. ZEN5 needs Juice to run properly.
Not really, notebooktech is just measuring total system power not the package power during st workloads
 

Attachments

  • IMG_2721.png
    335.3 KB · Views: 32

Abwx

Lifer
Apr 2, 2011
11,516
4,302
136
Last edited:

GTracing

Member
Aug 6, 2021
78
192
76
What? That is TOO MUCH IMO.
What a waste.
How small does it have to be to be acceptable? If it's too small then the NPU wouldn't be useful for anything, then it would really be a waste. I understand being skeptical of Microsoft's 40TOP mandate, but 10% of the die area is not that bad.

NPUs currently have a chicken and the egg problem where the software isn't going to be developed until there's hardware to run it on, and chip designers don't want to add hardware that doesn't offer any benefit right now. Microsoft is trying to jumpstart things by requiring a baseline NPU. It offers very little benefit right now, but in 5 years software support for NPUs will be far more mature than it would otherwise be.
 

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
So how would the 2nd decoder be used? It would decode the instruction immediately after the one being decoded by the first decoder?

On taken branches (technically speaking, when the predictor predicts that it's going to be taken), the code is jumping to a known address). That is where the second decoder cluster can immediately start working in parallel with the first decoder cluster.

In the future, they may add some predecode that would add instruction start marking into L1 cache or into fetch or soemthing, which could enhance this capability to be used more often. We don't know yet.

as I understand from tremont discussion second decoder should decode second path of branch

I don't think that is how it works, it doesn't seem to do decoding of both ways on branches, which this would amount to. The approach would harm power efficiency, likely.
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,747
136
Ian said that Zen 5c would see a much smaller drop in frequencies compared to what Z4c had relative to the classic cores.
Interesting.
[...] AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them. [...]
If I am looking at this right, they did manage to compact the core logic by quite some degree but, as to be expected, L2$ and L3$ diminish the overall area savings a lot. (The 8 MB L3$ of the Zen 5c CCX looks even less dense than the 16 MB L3$ of the Zen 5 CCX, but this is surely because of twice the number of cores to connect at the former.) The f_max which Strix Point's compact cores are designed for is probably OK for the limited TDPs for which this one is targeted at, and higher TDPs will really need to be covered by Fire Range...
 
Reactions: lightmanek

Geddagod

Golden Member
Dec 28, 2021
1,295
1,368
106
No. AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them.
(Source: Computerbase's article on this briefing. Also take Tom's Hardware's interview with Mike Clark in which the Zen 5c design targets are discussed in the context of heterogeneous CPUs, read: Strix Point. PS, I haven't watched the Cutress X Cozma video.)
Didn't see that, nice catch. Sounds like AMD wanted the dense cores to be more effecient at higher frequenies due to the fact that they are in their mainstream mobile lineup now.
 

naukkis

Senior member
Jun 5, 2002
869
733
136
On taken branches (technically speaking, when the predictor predicts that it's going to be taken), the code is jumping to a known address). That is where the second decoder cluster can immediately start working in parallel with the first decoder cluster.

In the future, they may add some predecode that would add instruction start marking into L1 cache or into fetch or soemthing, which could enhance this capability to be used more often. We don't know yet.

That's working approach with cpu that doesn't have MOP-cache like Intel E-cores. But with quite large MOP branches should mainly hit OP-cache when they are previously executed and decoded. I can't see much decoding improvement from such a branch targeting second decoder with MOP-cache cpus.
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,034
136
That's working approach with cpu that doesn't have MOP-cache like Intel E-cores. But with quite large MOP branches should mainly hit OP-cache when they are previously executed and decoded. I can't see much decoding improvement from such a branch targeting second decoder with MOP-cache cpus.

It's more about fetch than decode (but decode is a nice bonus.)

It means that you can potentially deliver uops from a second taken branch immediately without waiting to redirect fetch. That is useful. Not groundbreaking necessarily, but useful.
 

naukkis

Senior member
Jun 5, 2002
869
733
136
It's more about fetch than decode (but decode is a nice bonus.)

It means that you can potentially deliver uops from a second taken branch immediately without waiting to redirect fetch. That is useful. Not groundbreaking necessarily, but useful.

I thought that fetching uops has to happen in program order. Second taken branch can fetch instruction data and decode it but uop fetch has to happen in program order, so decoded instructions from two decoders are combined in program order before fetching them.
 

gdansk

Platinum Member
Feb 8, 2011
2,833
4,210
136
I have seen multiple people ask why not predict both branches but I think even the simplest loop that executes say a million times shows why that is a bad idea.

The not taken branch pays off once and you've wasted resources 999,999 other times.
 
Reactions: lightmanek

naukkis

Senior member
Jun 5, 2002
869
733
136
I have seen multiple people ask why not predict both branches but I think even the simplest loop that executes say a million times shows why that is a bad idea.

We don't know how AMD is implemented their dual decoding. They might very well decode both branches in cases they doesn't accurately know which branch to take. In which case max decode ability is still 4 for single thread, performance increase comes from situations when prediction accuracy is low and they can survive misprediction faster.
 

gdansk

Platinum Member
Feb 8, 2011
2,833
4,210
136
We don't know how AMD is implemented their dual decoding. They might very well decode both branches in cases they doesn't accurately know which branch to take. In which case max decode ability is still 4 for single thread, performance increase comes from situations when prediction accuracy is low and they can survive misprediction faster.
Yes, if they are taking both branches it has to be with very specific prediction history. In the general case pursuing both branches is a bad idea.
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,034
136
I thought that fetching uops has to happen in program order. Second taken branch can fetch instruction data and decode it but uop fetch has to happen in program order, so decoded instructions from two decoders are combined in program order before fetching them.

I said deliver immediately, not that they could be intermixed (though I'd be curious if there's any mechanism to do so, with tagging of PC. It's certainly not impossible and would allow early execution of uops after the second branch.)

But also, trivial scenarios like this stand to benefit, because you can start fetch/decode of doStuff in the same cycle as you predict someLikelyCondition:

if (someLikelyCondition)
if (someOtherLikelyCondition)
doStuff;

Or, another example:

if (someLikelyCondition)
doStuff;
doMoreStuff; // as soon as this is dispatched, uops from doEvenMoreStuff can start to be dispatched, as they've been fetched/decoded in parallel
if (someOtherLikelyCondition)
doEvenMoreStuff; // no bubble for fetch/decode - fetch/decode started at the same time as doStuff did
 
Last edited:

Timmah!

Golden Member
Jul 24, 2010
1,510
824
136

Getting this result on an air cooler is impressive. 13/14th gen needs an aio
If this was out of the box performance, or at very least something thats achievable by simple action like enabling PBO (that would not in turn decrease single-perf clock and performance, as was the case with 7000 series), not requring meddling around with curves, voltages and whatnot, then i guess it would be something to consider, after all. How much do you reckon one could sell 7950x for?
 
Reactions: lightmanek

desrever

Member
Nov 6, 2021
166
444
106
Power/perf scaling according to NBC.




So this means H370x is more than double the performance of the 155H at 15W, probably almost double the 185H too. Good luck to to Lunar Lake to catch up with advertised 50% improvement.

Really funny people here are acting like these aren't the best mobile CPUs outside of Apple.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,759
1,455
136
to save area

More CCXes -> more logic, more infinity fabric, more cache, more area.

Maybe makes the layout fit together nicer without needing to refactor other structures to fit in more compactly (AMD has in the past taken the L on area rather than refactoring).
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |