Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 783 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Mahboi

Senior member
Apr 4, 2024
962
1,742
96
Although a warning I am unable to tell how approachable they are if you don't have already some background in software.
Aw come on I have 0 formal education in electronics and just learned from Discord servers and listening in here. ^^
Even if I had zero background in software I could probably swim through (and I have quite a lot of software bg).
I'll try them thanks.
 
Reactions: MS_AT

DavidC1

Senior member
Dec 29, 2023
776
1,231
96
I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.
Zen 5's bonus output for the clusters is zero in ST, contrary to Tremont.
 

Hotrod2go

Senior member
Nov 17, 2021
349
232
86
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)
Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.
 
Last edited:

JustViewing

Senior member
Aug 17, 2022
216
381
106
Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,945
4,467
136
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.

Will APX help allievate this with more registers? I haven't heard much of anything about it since it was announced.
 

JustViewing

Senior member
Aug 17, 2022
216
381
106
Because of AMD's split FPU design, in the past it was advantages to utilize both General Purpose and SSE/AVX unit for maximum scalar throughput (it could be still true compared with intel big core). It would hurt Intel performance. So I don't think any compilers used like that. However, you could write a benchmark in assembly to hurt Intel using both sets of execution units.
 

FlameTail

Diamond Member
Dec 15, 2021
3,748
2,191
106
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
So ARM has an advantage because it already has 32 GPRs?
 

del42sa

Member
May 28, 2013
99
115
106
Because of AMD's split FPU design, in the past it was advantages to utilize both General Purpose and SSE/AVX unit for maximum scalar throughput (it could be still true compared with intel big core). It would hurt Intel performance. So I don't think any compilers used like that. However, you could write a benchmark in assembly to hurt Intel using both sets of execution units.
actually Lion Cove split FPU with the same fashion as AMD
 
Reactions: lightmanek

Bigos

Member
Jun 2, 2019
148
357
136
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?

If you state the main IPC blocker is the number of architectural registers, do you have a proof? The "just look at the assembly" is not one, 95%+ functions are not being executed enough to matter. Do you know of a study that analyzes hot functions in common software?

Also, I do not see any connection between the lack of addressable registers and the number of PRF read ports. Are you suddenly talking about renamed registers?

Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.
 
Reactions: Thunder 57

JustViewing

Senior member
Aug 17, 2022
216
381
106
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?

If you state the main IPC blocker is the number of architectural registers, do you have a proof? The "just look at the assembly" is not one, 95%+ functions are not being executed enough to matter. Do you know of a study that analyzes hot functions in common software?

Also, I do not see any connection between the lack of addressable registers and the number of PRF read ports. Are you suddenly talking about renamed registers?

Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.
This is from my experience programming in Assembly language. Register renaming would not help if you need to keep the value in registers. Lack of registers means, difficult to keep dependencies in registers. You constantly need to store and reload the values from L1. That is why I said having more read/write ports will help a bit with possible increase of power. For AVX, 16AVX registers in not much of a problem (most of the time). However, for me the 16 general purpose registers are not enough. With more registers, most of the ALUs can be utilized at once.
 

MS_AT

Member
Jul 15, 2024
192
424
91
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?
You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.
Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.
Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly.
Is there scalar int that can be handle via the SIMD unit? I recall only FP part.

Actually the architectural register breakdown looks like this:
x86 -> 8 32b General Purpose Registers (GPRs) [legacy 16b and 8b are aliased into the lower parts of 32b]
x64 -> 16 GPRs [legacy 32b,16b,8b are aliased into lower parts of first eight registers]
SSE/AVX/AVX2 -> 16 SIMD registers of width matching the instruction set. For 128b registers are aliased into lower parts of 256b registers if AVX is supported. [SSE is 8 wide for x86]
AVX512 -> 32 512b SIMD registers. 256/128b are aliased into lower parts of those 32 regs.
ARM Neon -> 16 128b SIMD registers.
ARM-v8/ARM-v9 -> 31 64b GPRs [32,16,8 are aliased into lower parts]

Now this write up is rather shallow, I excluded details that sometimes make the effective number of available registers smaller than it would seem. The takeaway is that ARM is enjoying the lead here but how much it matters depends on the code. Still I have run into places where 16 GPRs and 16 AVX2 registers were a limitation that due to frequent spilling was lowering performance.
 

Bigos

Member
Jun 2, 2019
148
357
136
You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.

I did not confuse anything. I just wanted to point out the same register can be used multiple times for different things to somehow alleviate the lack of architectural registers. I did not know if that concept was understood.

Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?

Agner Fog at least has documented that in his "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers": https://agner.org/optimize/

24.18 Mirroring memory operands

The Zen 4 can mirror memory operands inside the CPU so that memory operands have no
latency at all in some cases. See page 236 for a detailed description of this feature in Zen 2.
This feature was introduced in Zen 2, but absent in Zen 3. Now it has returned in Zen 4 with
some improvements. It works now with addressing modes that have a pointer and a 32-bit
offset, where Zen 2 allowed only 8-bit offsets. Some cases of false dependence have also
been removed, but not all.

I understand it that such instructions will both write to PRF and to memory, so the number of store pipelines still matters.
 

MS_AT

Member
Jul 15, 2024
192
424
91
You can load a single scalar Integer value and do the SIMD Integer operation on them. Then save the value as a single scalar Integer value. Yes it is an overkill, but can be done.
Ah true, I was wondering if there is a special instruction for that, but you don't need special instructions for everything
Agner Fog at least has documented that in his "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers": https://agner.org/optimize/
Ah yes, I forgot about this one, thanks for sharing That is the last piece of non AMD coverage of Zen5 that we are still missing, Agner's update about it.
I understand it that such instructions will both write to PRF and to memory, so the number of store pipelines still matters.
It seems to be the case, and it has some limitations that make me wonder if compilers are able to use it fully, but for someone writing assembly seems like a useful tool. Wish it could work with SIMD also
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,746
136
I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.
These caches are indeed faster, and L1d has become bigger. (And it doesn't stop at these caches, as e.g. BTB is bigger, ITLB is bigger, µop cache is dual ported, ROB is larger...) *However*, bigger L1 and faster L1/2/3 are benefiting workloads more which hit L1/2/3 a lot, whereas the returns are diminishing in workloads which have a sizable amount of cache misses to begin with. This goes without saying but I am mentioning it because you brought up games. Many video games empirically benefit from increase of L3 cache size and from decrease of main memory latency. So they are more akin to the latter type of workload.
 

Mahboi

Senior member
Apr 4, 2024
962
1,742
96
Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.
 

inquiss

Member
Oct 13, 2010
175
260
136
Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.
Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.
 

Abwx

Lifer
Apr 2, 2011
11,514
4,301
136
Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.


None of them, having worked at Intel s sales forces is the worst card to present at AMD recruitement offices.

An AMD rep said that he permanently had to correct some customers opinions that were polluted by Intel s sales men who were unrelentlessly badmouthing, creating lies and falsehood ad infinitum about AMD products to keep them from buying their hardware, they would be the worse hirings AMD could make, let them live their miserable destiny instead.
 

MS_AT

Member
Jul 15, 2024
192
424
91
Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.
You can load from L1 128B/c with AVX512, 64B/c with AVX2 and 32B/c with pure scalar code. For Zen4 it was respectively, 64B/c, 64B/c, 24B/c.
Now about AIDA it's testing aggregate bandwidth so comparing 6 cores to 8 cores will leave you with false assumptions as 8 core will most likely show greater score due to having more cores.
Games won't see such dramatic increase in BW as AIDA might let you believe as they are rarely using SIMD and none of them are using AVX512. For games latency improvements are more welcome.
 
Reactions: lightmanek
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |