Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Mahboi · Aug 19, 2024

MS_AT said:
Although a warning I am unable to tell how approachable they are if you don't have already some background in software.

Aw come on I have 0 formal education in electronics and just learned from Discord servers and listening in here. ^^
Even if I had zero background in software I could probably swim through (and I have quite a lot of software bg).
I'll try them thanks.

DavidC1 · Aug 19, 2024

MS_AT said:
I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.

Zen 5's bonus output for the clusters is zero in ST, contrary to Tremont.

Hotrod2go · Aug 19, 2024

igor_kavinski said:
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)

Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.

JustViewing · Aug 20, 2024

DavidC1 said:
Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.

Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.

Thunder 57 · Aug 20, 2024

JustViewing said:
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.

Will APX help allievate this with more registers? I haven't heard much of anything about it since it was announced.

JustViewing · Aug 20, 2024

Thunder 57 said:
Will APX help allievate this with more registers? I haven't heard much of anything about it since it was announced.

APX would open up the architecture for more IPC. But future CPUs have be to designed to take advantage of it. In my opinion, only after introduction of APX we could see a massive increase in general purpose IPC.

JustViewing · Aug 20, 2024

Because of AMD's split FPU design, in the past it was advantages to utilize both General Purpose and SSE/AVX unit for maximum scalar throughput (it could be still true compared with intel big core). It would hurt Intel performance. So I don't think any compilers used like that. However, you could write a benchmark in assembly to hurt Intel using both sets of execution units.

FlameTail · Aug 20, 2024

JustViewing said:
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.

So ARM has an advantage because it already has 32 GPRs?

del42sa · Aug 20, 2024

JustViewing said:
Because of AMD's split FPU design, in the past it was advantages to utilize both General Purpose and SSE/AVX unit for maximum scalar throughput (it could be still true compared with intel big core). It would hurt Intel performance. So I don't think any compilers used like that. However, you could write a benchmark in assembly to hurt Intel using both sets of execution units.

actually Lion Cove split FPU with the same fashion as AMD

JustViewing · Aug 20, 2024

FlameTail said:
So ARM has an advantage because it already has 32 GPRs?

Sort of, that one and fixed length instruction allows ARM higher IPC.

Bigos · Aug 20, 2024

JustViewing said:
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.

You know register renaming exists and allows one to improve parallelism with limited number of registers, right?

If you state the main IPC blocker is the number of architectural registers, do you have a proof? The "just look at the assembly" is not one, 95%+ functions are not being executed enough to matter. Do you know of a study that analyzes hot functions in common software?

Also, I do not see any connection between the lack of addressable registers and the number of PRF read ports. Are you suddenly talking about renamed registers?

Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.

JustViewing · Aug 20, 2024

Bigos said:
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?

If you state the main IPC blocker is the number of architectural registers, do you have a proof? The "just look at the assembly" is not one, 95%+ functions are not being executed enough to matter. Do you know of a study that analyzes hot functions in common software?

Also, I do not see any connection between the lack of addressable registers and the number of PRF read ports. Are you suddenly talking about renamed registers?

Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.

This is from my experience programming in Assembly language. Register renaming would not help if you need to keep the value in registers. Lack of registers means, difficult to keep dependencies in registers. You constantly need to store and reload the values from L1. That is why I said having more read/write ports will help a bit with possible increase of power. For AVX, 16AVX registers in not much of a problem (most of the time). However, for me the 16 general purpose registers are not enough. With more registers, most of the ALUs can be utilized at once.

MS_AT · Aug 20, 2024

Bigos said:
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?

You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.

Bigos said:
Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.

Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?

JustViewing said:
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly.

Is there scalar int that can be handle via the SIMD unit? I recall only FP part.

Actually the architectural register breakdown looks like this:
x86 -> 8 32b General Purpose Registers (GPRs) [legacy 16b and 8b are aliased into the lower parts of 32b]
x64 -> 16 GPRs [legacy 32b,16b,8b are aliased into lower parts of first eight registers]
SSE/AVX/AVX2 -> 16 SIMD registers of width matching the instruction set. For 128b registers are aliased into lower parts of 256b registers if AVX is supported. [SSE is 8 wide for x86]
AVX512 -> 32 512b SIMD registers. 256/128b are aliased into lower parts of those 32 regs.
ARM Neon -> 16 128b SIMD registers.
ARM-v8/ARM-v9 -> 31 64b GPRs [32,16,8 are aliased into lower parts]

Now this write up is rather shallow, I excluded details that sometimes make the effective number of available registers smaller than it would seem. The takeaway is that ARM is enjoying the lead here but how much it matters depends on the code. Still I have run into places where 16 GPRs and 16 AVX2 registers were a limitation that due to frequent spilling was lowering performance.

JustViewing · Aug 20, 2024

MS_AT said:
Is there scalar int that can be handle via the SIMD unit? I recall only FP part.

You can load a single scalar Integer value and do the SIMD Integer operation on them. Then save the value as a single scalar Integer value. Yes it is an overkill, but can be done.

Bigos · Aug 20, 2024

MS_AT said:
You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.

I did not confuse anything. I just wanted to point out the same register can be used multiple times for different things to somehow alleviate the lack of architectural registers. I did not know if that concept was understood.

MS_AT said:
Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?

Agner Fog at least has documented that in his "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers": https://agner.org/optimize/

24.18 Mirroring memory operands

The Zen 4 can mirror memory operands inside the CPU so that memory operands have no
latency at all in some cases. See page 236 for a detailed description of this feature in Zen 2.
This feature was introduced in Zen 2, but absent in Zen 3. Now it has returned in Zen 4 with
some improvements. It works now with addressing modes that have a pointer and a 32-bit
offset, where Zen 2 allowed only 8-bit offsets. Some cases of false dependence have also
been removed, but not all.

I understand it that such instructions will both write to PRF and to memory, so the number of store pipelines still matters.

MS_AT · Aug 20, 2024

JustViewing said:
You can load a single scalar Integer value and do the SIMD Integer operation on them. Then save the value as a single scalar Integer value. Yes it is an overkill, but can be done.

Ah true, I was wondering if there is a special instruction for that, but you don't need special instructions for everything

Bigos said:
Agner Fog at least has documented that in his "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers": https://agner.org/optimize/

Ah yes, I forgot about this one, thanks for sharing That is the last piece of non AMD coverage of Zen5 that we are still missing, Agner's update about it.

Bigos said:
I understand it that such instructions will both write to PRF and to memory, so the number of store pipelines still matters.

It seems to be the case, and it has some limitations that make me wonder if compilers are able to use it fully, but for someone writing assembly seems like a useful tool. Wish it could work with SIMD also

StefanR5R · Aug 20, 2024

Hotrod2go said:
I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.

These caches are indeed faster, and L1d has become bigger. (And it doesn't stop at these caches, as e.g. BTB is bigger, ITLB is bigger, µop cache is dual ported, ROB is larger...) *However*, bigger L1 and faster L1/2/3 are benefiting workloads more which hit L1/2/3 a lot, whereas the returns are diminishing in workloads which have a sizable amount of cache misses to begin with. This goes without saying but I am mentioning it because you brought up games. Many video games empirically benefit from increase of L3 cache size and from decrease of main memory latency. So they are more akin to the latter type of workload.

csbin · Aug 20, 2024

Wow 2.6-4.5

https://twitter.com/x/status/1825832221678383376

Mahboi · Aug 20, 2024

Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.

inquiss · Aug 20, 2024

Mahboi said:
Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.

Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.

CouncilorIrissa · Aug 20, 2024

Mahboi said:
Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.

It's N4P, 9655 has Z5 classic cores.
Wouldn't make much sense to clock Dense that high anyway.

Abwx · Aug 20, 2024

inquiss said:
Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.

None of them, having worked at Intel s sales forces is the worst card to present at AMD recruitement offices.

An AMD rep said that he permanently had to correct some customers opinions that were polluted by Intel s sales men who were unrelentlessly badmouthing, creating lies and falsehood ad infinitum about AMD products to keep them from buying their hardware, they would be the worse hirings AMD could make, let them live their miserable destiny instead.

Mahboi · Aug 20, 2024

For comparison: https://www.amd.com/products/proces...ation-9004-and-8004-series/amd-epyc-9654.html (hopefully that link works, I keep getting redirected to amd.com/fr)
Boost maximal 3.7 GHz
All Core Boost Speed 3.55 GHz
Base clock 2.4 GHz

Arguably it's only 200Mhz above Z4 in base clocks but still...

MS_AT · Aug 20, 2024

Hotrod2go said:
Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.

You can load from L1 128B/c with AVX512, 64B/c with AVX2 and 32B/c with pure scalar code. For Zen4 it was respectively, 64B/c, 64B/c, 24B/c.
Now about AIDA it's testing aggregate bandwidth so comparing 6 cores to 8 cores will leave you with false assumptions as 8 core will most likely show greater score due to having more cores.
Games won't see such dramatic increase in BW as AIDA might let you believe as they are rarely using SIMD and none of them are using AVX512. For games latency improvements are more welcome.

mostwanted002 · Aug 20, 2024

https://twitter.com/x/status/1825859218802618610

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Senior member

Senior member

Senior member

Senior member

Platinum Member

Senior member

Senior member

Diamond Member

Member

Senior member

Member

Senior member

Member

Senior member

Member

Member

Elite Member

Senior member

Senior member

Member

Senior member

Lifer

Senior member

Member

Member