Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

StefanR5R · Aug 2, 2024

Off topic,

The Hardcard said:
Surely, you’ve noticed that multiple countries have begun spending hundreds of billions of dollars to ensure that they have angstrom scale fabs inside their borders. Each of those countries will massively increase spending and several other nations will join them.

No, spending on this will at best stagnate. They will all have more pressing issues to solve, even if we consider only their supply chain issues and ignore everything else that's mounting.

StefanR5R · Aug 2, 2024

CouncilorIrissa said:
Zen 5.5* 🤣

Seriously though, is Zen "5.5" one of the targets they are working on until the launch of Turin? Which, at this time, is still communicated as "in 2H 2024" AFAIK. (3,638 hours left.)

DisEnchantment · Aug 2, 2024

MS_AT said:
I am afraid it's about bad marketing message and miscommunication. The materials were mentioning that decoders are statically partitioned in SMT mode. Now traditionally when you wanted to turn off SMT, you went to BIOS and disabled it. Now the question is, is the SMT mode static when enabled [If SMT is on in the BIOS is the core always in SMT mode] or is it dynamic like the interviews are leading us to believe.

From a technical perspective, for the Strix mobile parts, the front end could be curtailed in the interest of efficiency if the memory subsystem cannot sustain the BW needed. (e.g. smaller L3, slower memory).
In which case, if true, really supports the argument that the fabric and memory/cache hierarchy from L2 onwards need an big improvement.

David already mentioned, there are changes in the front end behavior across the BIOSes.

It can be seen that AMD intentionally limited the performance of the processor front end and some instruction combinations in the previous microcode.

Zen 5 补充测试 (1/2): 更多微架构细节 | David Huang's Blog

We will see if this is the case with the DT parts.

CouncilorIrissa · Aug 2, 2024

DisEnchantment said:
From a technical perspective, for the Strix mobile parts, the front end could be curtailed in the interest of efficiency if the memory subsystem cannot sustain the BW needed. (e.g. smaller L3, slower memory).
In which case, if true, really supports the argument that the fabric and memory/cache hierarchy from L2 onwards need an big improvement.

The classic cores in Strix have the same amount of L3 per core as desktop tho, 4MB.

StefanR5R said:
Seriously though, is Zen "5.5" one of the targets they are working on until the launch of Turin? Which, at this time, is still communicated as "in 2H 2024" AFAIK. (3,638 hours left.)

You think they'd release the client lineup without enabling all of the core's features because they aren't ready? Seems highly unlikely, they'd rather postpone the launch I think. They don't need early reviews to be less positive than they could be.

Kryohi · Aug 2, 2024

csbin said:
oh, nooooooooo

https://twitter.com/x/status/1819215703142187164

Translated added -

Mod DAPUNISHER

What I get from this: zen32% still possible 😎

/s

Tuna-Fish · Aug 2, 2024

The Hardcard said:
I am a straight up bubble huffer. I say that AI functions will be the main reason for sales of computing devices by 2040. I still find it bizarre there’s so many people on various text site forums don’t see that AI is THE future of computing in society.

I am also broadly optimistic on AI. We are definitely currently in a bubble and it will eventually burst, but AI will not die then and there will be new growth after that.

I am not optimistic on edge AI on client devices. The current offerings are clearly limited, and the only way we know to improve is grow the models. So compute demand will grow with supply, offerings stuck at a lower performance level will likely lose in the market because they will simply be worse.

And when you want to scale performance up, client has the unfixable problem of being less efficient than centralized. Because utilization will be lower (usage by a single human is bursty), and because the only way to get reasonable cache locality is batching. Meaning that it will simply cost a lot less to provide a service with equivalent quality in a centralized system with fairly thin clients, than it costs to put the AI on the client. And this will only ever end when AI is well more than "good enough", which will probably take decades.

Yes, there is less privacy on centralized systems. But most people are still using facebook and tiktok, so they clearly don't care.

CouncilorIrissa · Aug 2, 2024

Kryohi said:
What I get from this: zen32% still possible 😎

/s

1 decode cluster = 16% IPC
2 clusters = 16% * 2 = 32%

/s

Abwx · Aug 2, 2024

More tests :

https://twitter.com/x/status/1819317213008220290

Det0x · Aug 2, 2024

Hmm why does Zen5C have higher FP IPC than vanilla Zen5 ? (performance/GHz)

Somethings fishy with the clockspeeds i guess

JustViewing · Aug 2, 2024

Det0x said:
Hmm why does Zen5C have higher FP IPC than vanilla Zen5 ? (performance/GHz)

View attachment 104342

Somethings fishy with the clockspeeds i guess

Memory bottleneck?

JustViewing · Aug 2, 2024

Abwx said:
More tests :

https://twitter.com/x/status/1819317213008220290

I wonder how AMD/Intel will perform in FP if AVX512 is used in relation to Apple?

MS_AT · Aug 2, 2024

Det0x said:
Hmm why does Zen5C have higher FP IPC than vanilla Zen5 ? (performance/GHz)

View attachment 104342

Somethings fishy with the clockspeeds i guess

To do per core test, you need to pin the test to the core. The thing is, they running under WSL2, so they are running a Linux inside Hyper-V virtual machine [this what WSL2 is], so they can pin all they want, hypervisor won't care... https://github.com/Microsoft/WSL/issues/3827 unless M$ fixed that but the issue is still open on github. In other words they may think they are running the test on the Z5c core, but in reality it might be running on either Z5 or Z5c or on both of them.

At least to the best of my knowledge. Might be there exist a way to reliably pin workload to the core then I would be grateful if somebody could share it.

Tuna-Fish · Aug 2, 2024

Det0x said:
Hmm why does Zen5C have higher FP IPC than vanilla Zen5 ? (performance/GHz)

More memory bandwidth per CPU GHz. Nearly all realistic SIMD loads are also very heavy on the memory subsystem.

yottabit · Aug 2, 2024

I really thought Strix Halo would have on-package memory for some reason. Hmm

The idea of having to actually disable SMT to get a 1t uplift on say a 9950x doesn’t seem too bad, I mean that would work well for my use cases at least

However I’m highly skeptical of something like that existing, if the tech for dynamic allocation of those decode units isn’t there I doubt disabling SMT would magically enable them to be shared on a core. We haven’t seen any evidence of that yet right?

PJVol · Aug 2, 2024

It's funny that after 100500 pages of speculation about how different cores in Strix perform, I've yet to see a single screenshot of Strix HWInfo64 telemetry under different loads and core/CCX affinity set.

Josh128 · Aug 2, 2024

yottabit said:
The idea of having to actually disable SMT to get a 1t uplift on say a 9950x doesn’t seem too bad, I mean that would work well for my use cases at least

Where'd you get this idea from?

LightningZ71 · Aug 2, 2024

With Zen5c-256 being denser and targeted at lower clock speeds, are they achieving lower internal cache access latencies as compared to full Zen5-256? Those latencies can be important for Spec as most of it largely runs in cache. If they fine tuned it further, they might also be achieving slightly better latencies on certain complex instructions as well.

Has anyone done a full instruction latency profile on Zen5 vs Zen5c?

StefanR5R · Aug 2, 2024

Det0x said:
Hmm why does Zen5C have higher FP IPC than vanilla Zen5 ? (performance/GHz)

Remember what "IPC" is.
– Actual meaning of the TLA: instructions per clock
– Terrible misuse of the TLA: iso-clock performance
– Plan 9 From Outer Space level abuse of the TLA: clock-normalized performance
The diagram labels indicate which one of these three is shown.

StefanR5R · Aug 2, 2024

LightningZ71 said:
With Zen5c-256 being denser and targeted at lower clock speeds, are they achieving lower internal cache access latencies as compared to full Zen5-256?

L2$ and L3$ densities of the two CCXs are the same, except that the 5c cluster has more "stops" on the L3$'s bus.

LightningZ71 · Aug 2, 2024

Latency isn't just a function of cache density. Latencies are dialed in to account for worst case transistor response and transmission times. With lower target clocks, you need to wait fewer clock cycles to cover the same actual time delay, enabling the reduction in clock cycle latencies designed into the hardware.

IIRC, pne of the things we see in Apple's lower clocked cores is large caches with lower clock latencies. This is one way that they extract higher average IPC from their designs.

I've no idea if AMD tried any of this, or just left Zen5c as just a denser Zen5 with no other notable changes. Adjusting these latencies is absolutely not as trivial as it might seem.

rainy · Aug 2, 2024

CouncilorIrissa said:
1 decode cluster = 16% IPC
2 clusters = 16% * 2 = 32%

/s

Honestly, I think it's the only logical conclusion./s

Nothingness · Aug 2, 2024

- removed stoopid joke

CouncilorIrissa · Aug 2, 2024

Nothingness said:
- removed stoopid joke

Hey, bring that back. This thread can't become any more of a dumpster fire than it already is.

StefanR5R · Aug 2, 2024

LightningZ71 said:
I've no idea if AMD tried any of this, or just left Zen5c as just a denser Zen5 with no other notable changes.

There are at least AnandTech's core-to-core latency measurements, in which latency (in nanoseconds, not necessarily in cycle counts) between dense cluster members is larger than between classic cluster members. Granted, the classic cluster has got only half the number of cores attached to its internal bus as the dense cluster has, IOW the classic cluster's bus is topologically smaller.

However, these measurement results are weird (to me) as they show thread siblings having the same CPU-to-CPU latency as L3$ siblings for HX 370 (in contrast to 7940HS, where thread siblings reach each other considerably quicker).

Hitman928 · Aug 2, 2024

Abwx said:
More tests :

https://twitter.com/x/status/1819317213008220290

Interesting that Spec with GCC shows that much higher IPC gain versus test with Clang. I wonder what flags they were using as well.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Elite Member

Elite Member

Golden Member

Senior member

Member

Golden Member

Senior member

Lifer

Golden Member

Senior member

Senior member

Senior member

Golden Member

Golden Member

Senior member

Senior member

Golden Member

Elite Member

Elite Member

Golden Member

Senior member

Diamond Member

Senior member

Elite Member

Diamond Member