Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

LightningZ71 · Jul 29, 2024

LightningZ71 said:
Thinking about the Zen5c CCX on Strix Point, that's the least amount of L3 cache per thread since Lucienne. Yes, the L2 is twice as large and still exclusive, so the apparent L2/L3 cache is 33% larger, but Zen5c, even restricted to ~3.6Ghz, is much higher performance that Zen2 at up to 4-4.5Ghz. This becomes an even bigger issue with AVX-512 code as that can quickly become memory throughput bound.

All of that is to say that the Zen5c CCX is very memory bandwidth starved and will be hitting the memory controller and snooping the other CCX a lot. This seems like a very hamstrung design unless the memory controller has gobs of bandwidth available at low latency. A MALL cache would have made this make sense. The current design just exacerbates existing problems with limited bandwidth.

This leads me to believe that AMD changed course hard after they had frozen critical aspects of Strix Point's design, like the rumored NPU expansion that removed the MALL. They wanted to hit a marketing goal of having a processor that looked like Alder Lake/Raptor Lake, but things went sideways along the way and they didn't course correct well.

This also shows up in the iGPU. Adding an additional WGP and increasing the clocks makes sense if you are anticipating lots of bandwidth. It's barely higher than Hawk and iGPU performance, especially for non-compute bound tasks, is already showing to be only modestly better.

Had they planned for this originally, it would have made more sense to ax the extra WGP and, instead, double the L2 to 4MB while keeping the clock bump. Total floor plan area would be broadly similar, but there would need to be a bit of rearranging. This change would also have helped to mitigate some of the memory contention issues the processor faces. What it would have negatively impacted is AI workloads as there would be less compute available for it, though, with less thermal load generated, the apu would be able to maintain higher clocks for the iGPU in thermally limited laptops.

That MALL cache change, if real, really impacted this whole thing.

SarahKerrigan · Jul 29, 2024

FlameTail said:
I'm amazed that this comment got jumped on (-7 downvotes? ).

I made that comment in the context of a few people discussing how some of Zen5's architectural changes seemed questionable (which could also be said of Bulldozer).

I don't think they're all that questionable (aside from perhaps the CCX layout but I've had issues with heterogeneous multicore for a while.) Zen5 is a performance and efficiency improvement over Zen4 with no major shrink. It's fine. It's a better uarch than Zen4. Folks just got a little over-enthusiastic, egged on by some of the usual suspects, in the run-up to release.

SarahKerrigan · Jul 29, 2024

poke01 said:
There are very few people in the media/online who can conduct proper SPEC tests and David is one of those people.

That's a bit much. SPEC takes maybe two hours to set up, including installing and writing up a config. It's not some deep and forbidden art.

Saylick · Jul 29, 2024

Strix Point die shot:

GTracing · Jul 29, 2024

Nice, could you add the source?
Edit: nevermind, I'm dumb

Saylick said:
Strix Point die shot:
View attachment 104112

Edit, if I'm interpreting this right, the NPU doesn't take up that much die area. About as much as 4 Zen5c or 8 RDNA 3.5 CUs.

Malachijtjfjf · Jul 29, 2024

Philste said:
But this also shows that ZEN5 need more juice compared to ZEN4. The advantage at low wattages is pretty meh for 50% more threads. It only starts getting decent at power levels above ZEN4s sweetspot. I'm 80% sure we will see that every ZEN5 Desktop SKU is slower than it's predecessor at low wattages. Similar to Igors Leaks, the ZEN5 ES was besten by 7950X pretty much everywhere below 100W. ZEN5 needs Juice to run properly.

Not really, notebooktech is just measuring total system power not the package power during st workloads

GTracing · Jul 29, 2024

igor_kavinski said:
So ideally, there should be 4 decoders, to handle the case of two branching pathways and the instruction to be executed after the branch is entered.

Not sure, I'm no expert and I don't work in the industry. I just like reading about CPU architecture.

Panino Manino · Jul 29, 2024

GTracing said:
Edit, if I'm interpreting this right, the NPU doesn't take up that much die area. About as much as 4 Zen5c or 8 RDNA 3.5 CUs.

What? That is TOO MUCH IMO.
What a waste.

Abwx · Jul 29, 2024

Power/perf scaling according to NBC.

AMD Zen 5 Strix Point CPU analysis - Ryzen AI 9 HX 370 versus Intel Core Ultra, Apple M3 and Qualcomm Snapdragon X Elite

Notebookcheck takes a look at the new AMD Zen 5 Ryzen AI 9 HX 370 processor and compares it with the Intel Core Ultra, Apple M3 Pro and Qualcomm Snapdragon X Elite.

www.notebookcheck.net

GTracing · Jul 29, 2024

Panino Manino said:
What? That is TOO MUCH IMO.
What a waste.

How small does it have to be to be acceptable? If it's too small then the NPU wouldn't be useful for anything, then it would really be a waste. I understand being skeptical of Microsoft's 40TOP mandate, but 10% of the die area is not that bad.

NPUs currently have a chicken and the egg problem where the software isn't going to be developed until there's hardware to run it on, and chip designers don't want to add hardware that doesn't offer any benefit right now. Microsoft is trying to jumpstart things by requiring a baseline NPU. It offers very little benefit right now, but in 5 years software support for NPUs will be far more mature than it would otherwise be.

Jan Olšan · Jul 29, 2024

igor_kavinski said:
So how would the 2nd decoder be used? It would decode the instruction immediately after the one being decoded by the first decoder?

On taken branches (technically speaking, when the predictor predicts that it's going to be taken), the code is jumping to a known address). That is where the second decoder cluster can immediately start working in parallel with the first decoder cluster.

In the future, they may add some predecode that would add instruction start marking into L1 cache or into fetch or soemthing, which could enhance this capability to be used more often. We don't know yet.

bakyt115 said:
as I understand from tremont discussion second decoder should decode second path of branch

I don't think that is how it works, it doesn't seem to do decoding of both ways on branches, which this would amount to. The approach would harm power efficiency, likely.

Kosusko · Jul 29, 2024

CPU-Z 64-bit benchmark v17.01

AMD Zen 5 a Zen 5c (AMD Ryzen AI 9 HX 370 12C/24T) vs. Intel P-Cores / E-Cores (Intel Core i5 13500H /4P + 8E / 16T/)

link: https://diit.cz/clanek/preview-asus...9-hx-370-je-v-redakci/diskuse#comment-1466040
source: https://diit.cz/clanek/preview-asus-zenbook-s16-s-amd-ryzen-ai-9-hx-370-je-v-redakci

SarahKerrigan · Jul 29, 2024

Jan Olšan said:
I don't think that is how it works, it doesn't seem to do decoding of both ways on branches, which this would amount to. The approach would harm power efficiency, likely.

Yeah, that would be a really horrible idea considering branch predictors have near-100% accuracy.

I would encourage folks to read the Seznec paper.

StefanR5R · Jul 29, 2024

CouncilorIrissa said:
Ian said that Zen 5c would see a much smaller drop in frequencies compared to what Z4c had relative to the classic cores.
Interesting.

StefanR5R said:
[...] AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them. [...]

Saylick said:
Strix Point die shot:
View attachment 104112

If I am looking at this right, they did manage to compact the core logic by quite some degree but, as to be expected, L2$ and L3$ diminish the overall area savings a lot. (The 8 MB L3$ of the Zen 5c CCX looks even less dense than the 16 MB L3$ of the Zen 5 CCX, but this is surely because of twice the number of cores to connect at the former.) The f_max which Strix Point's compact cores are designed for is probably OK for the limited TDPs for which this one is targeted at, and higher TDPs will really need to be covered by Fire Range...

Geddagod · Jul 29, 2024

StefanR5R said:
No. AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them.
(Source: Computerbase's article on this briefing. Also take Tom's Hardware's interview with Mike Clark in which the Zen 5c design targets are discussed in the context of heterogeneous CPUs, read: Strix Point. PS, I haven't watched the Cutress X Cozma video.)

Didn't see that, nice catch. Sounds like AMD wanted the dense cores to be more effecient at higher frequenies due to the fact that they are in their mainstream mobile lineup now.

naukkis · Jul 29, 2024

Jan Olšan said:
On taken branches (technically speaking, when the predictor predicts that it's going to be taken), the code is jumping to a known address). That is where the second decoder cluster can immediately start working in parallel with the first decoder cluster.

In the future, they may add some predecode that would add instruction start marking into L1 cache or into fetch or soemthing, which could enhance this capability to be used more often. We don't know yet.

That's working approach with cpu that doesn't have MOP-cache like Intel E-cores. But with quite large MOP branches should mainly hit OP-cache when they are previously executed and decoded. I can't see much decoding improvement from such a branch targeting second decoder with MOP-cache cpus.

SarahKerrigan · Jul 29, 2024

naukkis said:
That's working approach with cpu that doesn't have MOP-cache like Intel E-cores. But with quite large MOP branches should mainly hit OP-cache when they are previously executed and decoded. I can't see much decoding improvement from such a branch targeting second decoder with MOP-cache cpus.

It's more about fetch than decode (but decode is a nice bonus.)

It means that you can potentially deliver uops from a second taken branch immediately without waiting to redirect fetch. That is useful. Not groundbreaking necessarily, but useful.

naukkis · Jul 29, 2024

SarahKerrigan said:
It's more about fetch than decode (but decode is a nice bonus.)

It means that you can potentially deliver uops from a second taken branch immediately without waiting to redirect fetch. That is useful. Not groundbreaking necessarily, but useful.

I thought that fetching uops has to happen in program order. Second taken branch can fetch instruction data and decode it but uop fetch has to happen in program order, so decoded instructions from two decoders are combined in program order before fetching them.

gdansk · Jul 29, 2024

I have seen multiple people ask why not predict both branches but I think even the simplest loop that executes say a million times shows why that is a bad idea.

The not taken branch pays off once and you've wasted resources 999,999 other times.

naukkis · Jul 29, 2024

gdansk said:
I have seen multiple people ask why not predict both branches but I think even the simplest loop that executes say a million times shows why that is a bad idea.

We don't know how AMD is implemented their dual decoding. They might very well decode both branches in cases they doesn't accurately know which branch to take. In which case max decode ability is still 4 for single thread, performance increase comes from situations when prediction accuracy is low and they can survive misprediction faster.

gdansk · Jul 29, 2024

naukkis said:
We don't know how AMD is implemented their dual decoding. They might very well decode both branches in cases they doesn't accurately know which branch to take. In which case max decode ability is still 4 for single thread, performance increase comes from situations when prediction accuracy is low and they can survive misprediction faster.

Yes, if they are taking both branches it has to be with very specific prediction history. In the general case pursuing both branches is a bad idea.

SarahKerrigan · Jul 29, 2024

naukkis said:
I thought that fetching uops has to happen in program order. Second taken branch can fetch instruction data and decode it but uop fetch has to happen in program order, so decoded instructions from two decoders are combined in program order before fetching them.

I said deliver immediately, not that they could be intermixed (though I'd be curious if there's any mechanism to do so, with tagging of PC. It's certainly not impossible and would allow early execution of uops after the second branch.)

But also, trivial scenarios like this stand to benefit, because you can start fetch/decode of doStuff in the same cycle as you predict someLikelyCondition:

if (someLikelyCondition)
if (someOtherLikelyCondition)
doStuff;

Or, another example:

if (someLikelyCondition)
doStuff;

  doMoreStuff; // as soon as this is dispatched, uops from doEvenMoreStuff can start to be dispatched, as they've been fetched/decoded in parallel

if (someOtherLikelyCondition)
doEvenMoreStuff; // no bubble for fetch/decode - fetch/decode started at the same time as doStuff did

Timmah! · Jul 29, 2024

poke01 said:
https://twitter.com/x/status/1817809685896155340

Getting this result on an air cooler is impressive. 13/14th gen needs an aio

If this was out of the box performance, or at very least something thats achievable by simple action like enabling PBO (that would not in turn decrease single-perf clock and performance, as was the case with 7000 series), not requring meddling around with curves, voltages and whatnot, then i guess it would be something to consider, after all. How much do you reckon one could sell 7950x for?

desrever · Jul 29, 2024

Abwx said:
Power/perf scaling according to NBC.

AMD Zen 5 Strix Point CPU Analyse - Ryzen AI 9 HX 370 gegen Intel Core Ultra, Apple M3 und Qualcomm Snapdragon X Elite

Notebookcheck Analyse des neuen Zen 5 Prozessors AMD Ryzen AI 9 HX 370 im Vergleich mit Intel Core Ultra, Apple M3 Pro und Qualcomm Snapdragon X Elite.

www.notebookcheck.com

So this means H370x is more than double the performance of the 155H at 15W, probably almost double the 185H too. Good luck to to Lunar Lake to catch up with advertised 50% improvement.

Really funny people here are acting like these aren't the best mobile CPUs outside of Apple.

HurleyBird · Jul 29, 2024

bakyt115 said:
to save area

More CCXes -> more logic, more infinity fabric, more cache, more area.

Maybe makes the layout fit together nicer without needing to refactor other structures to fit in more compactly (AMD has in the past taken the L on area rather than refactoring).

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Golden Member

Senior member

Senior member

Diamond Member

Member

Member

Attachments

Member

Senior member

Lifer

Member

Senior member

Member

CPU-Z 64-bit benchmark v17.01​

AMD Zen 5 a Zen 5c (AMD Ryzen AI 9 HX 370 12C/24T) vs. Intel P-Cores / E-Cores (Intel Core i5 13500H /4P + 8E / 16T/)​

Senior member

Elite Member

Golden Member

Senior member

Senior member

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Golden Member

Member

Platinum Member

CPU-Z 64-bit benchmark v17.01

AMD Zen 5 a Zen 5c (AMD Ryzen AI 9 HX 370 12C/24T) vs. Intel P-Cores / E-Cores (Intel Core i5 13500H /4P + 8E / 16T/)