Discussion Zen 5 Architecture & Technical discussion

BorisTheBlade82 · Jul 15, 2024

Although it is not on the slides, according to THG the Reorder Buffer has 448 entries.
Haven't had the time to watch the CnC video but I guess they might provide some more figures as well - or in an article later on.

dttprofessor · Jul 15, 2024

I wanna know the clustered decode of zen 5 is still useful without SMT?

coercitiv · Jul 15, 2024

dttprofessor said:
I wanna know the clustered decode of zen 5 is still useful without SMT?

no

SarahKerrigan · Jul 15, 2024

dttprofessor said:
I wanna know the clustered decode of zen 5 is still useful without SMT?

Yes.

gdansk · Jul 15, 2024

dttprofessor said:
I wanna know the clustered decode of zen 5 is still useful without SMT?

Mike Clark talked about that in the Chips and Cheese interview

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

Today’s “article” is a little bit different to what you readers are used to.

chipsandcheese.com

It sounds like yes, but.

BorisTheBlade82 · Jul 15, 2024

I think, CnC will elaborate on that in their written article later on. But from the video it was a rather clear "yes". For a full confirmation, micro benchmarks will be needed.

dttprofessor · Jul 15, 2024

When every thread has been used ,every thread use just 1*4 decoder,the IPC should decline?

coercitiv · Jul 15, 2024

dttprofessor said:
When every thread has been used ,every thread use just 1*4 decoder,the IPC should decline?

No, IPC will go up, the same way you posting the same question in two forum threads at the same time has increased your actions per minute by a considerable amount.

Saylick · Jul 15, 2024

Some rough math (normalized to 16% IPC gain):

Component	Portion of Total IPC Gain	IPC Gain
Fetch/Branch Prediction	12.8%	2.05%
Decode/Opcache	26.8%	4.29%
Execution/Retire	33.6%	5.38%
Data Bandwidth	26.8%	4.29%

DavidC1 · Jul 15, 2024

Saylick said:
Some rough math (normalized to 16% IPC gain):

Component
Portion of Total IPC Gain
IPC Gain
Fetch/Branch Prediction
12.8%
2.05%
Decode/Opcache
26.8%
4.29%
Execution/Retire
33.6%
5.38%
Data Bandwidth
26.8%
4.29%

View attachment 103113

Look at this. A whole 2% gain for fetch and branch prediction. This is how CPU designs are.

And why we scrutinize every way the parts are tested. Because maybe using a slow SSD instead of a fast SSD would result in "Oops! There goes all the gains from the branch prediction unit!".

StefanR5R · Jul 15, 2024

If AMD's slide preppers asked engineering for such a breakdown on two different times of day, they would get two different breakdowns. (Or three even.)

yuri69 · Jul 15, 2024

At 2:10 in the CnC's video Mr. Clark clearly states they still do decoding in-order. What does that mean in context of Tremont+ style of fetch & decode?

Saylick · Jul 15, 2024

yuri69 said:
At 2:10 in the CnC's video Mr. Clark clearly states they still do decoding in-order. What does that mean in context of Tremont+ style of fetch & decode?

I interpreted that as meaning you still have fetch ---> decode, which is in-order, but there's two parallel paths now instead of one, meaning you have a dual-ported instruction fetch feeding into dual decoders. I also interpreted Mike's following statements about each decoder knowing where to start as meaning both decoders can work on the same instruction stream.

zacharychieply · Jul 15, 2024

for those wondering, it sounds like from the interview, that zen 5 is 8 wide decode, but can only decode from one branch path of a single thread at a time per that decode unit. so real world gains would still be only around the previous zen 4 offerings of 4x ILP.

DavidC1 · Jul 15, 2024

Saylick said:
I interpreted that as meaning you still have fetch ---> decode, which is in-order, but there's two parallel paths now instead of one, meaning you have a dual-ported instruction fetch feeding into dual decoders. I also interpreted Mike's following statements about each decoder knowing where to start as meaning both decoders can work on the same instruction stream.

I again want to know why David Huang's results showed zero parallelism from the two clusters in ST. Is it disabled in mobile or does it only work situationally?

If it only works situationally then they decided to do so to limit increase in resources and leave it for future generations when process advancements give them more room. And with SMT they still could take full advantage of it.

Going from 1x9 throughput uopcache to 2x6 throughput uopcache is also an SMT minded decision, because with ST, it would be a downgrade. The BPU can also fetch twice as much as before, fitting the dual cluster setup.

CouncilorIrissa · Jul 15, 2024

DavidC1 said:
I again want to know why David Huang's results showed zero parallelism from the two clusters in ST. Is it disabled in mobile or does it only work situationally?

I would exercise patience in this situation tbh. Huang's article was based on an ES with a non-final microcode and he himself admitted to having been time-constrained. Let's wait until C&C get their hands on the thing.

zacharychieply · Jul 15, 2024

I know this is speculation, but if both decoders could work on the same code path, he woudn't have been so vague about it in the interview.

itsmydamnation · Jul 15, 2024

zacharychieply said:
for those wondering, it sounds like from the interview, that zen 5 is 8 wide decode, but can only decode from one branch path of a single thread at a time per that decode unit. so real world gains would still be only around the previous zen 4 offerings of 4x ILP.

Except you have a 6k uop cache, so even if that is the case , i listed to that section 3 times and cant find anything as definitive as you claim your claim of ILP, it wont hold true at dispatch.

zacharychieply · Jul 15, 2024

itsmydamnation said:
Except you have a 6k uop cache, so even if that is the case , i listed to that section 3 times and cant find anything as definitive as you claim your claim of ILP, it wont hold true at dispatch.

uop cache gets around decoding, but hits in the cache only apply to small loops, so i would argue its not really sustained perf here.

itsmydamnation · Jul 15, 2024

zacharychieply said:
uop cache gets around decoding, but hits in the cache only apply to small loops, so i would argue its not really sustained perf here.

would they not be storing highly likely taken branches uops in the op cache. So when you look over time your decode a cycle a thread is higher . Mike said they can "look ahead" for branches.

cherullo · Jul 15, 2024

DavidC1 said:
Going from 1x9 throughput uopcache to 2x6 throughput uopcache is also an SMT minded decision, because with ST, it would be a downgrade.

As far as I understand, entries in the uop-cache contain up to N consecutive instructions starting from a given address, but it may contain less instructions if:
- An instruction crosses a cacheline, or;
- There is a branch in the instruction stream, or;
- The current entry doesn't have enough space to hold an instruction that is decoded into multiple uops (probably rare though).

So if the new uop-cache can, when operating in 1T mode, retrieve two consecutive entries or entries across branches, then it's actually a win in every case. And it should be quite doable since the BTB already has the target address of the next branch.

I agree that this probably has greater impact in SMT, though. I hope someone figures out the chicken bits for these features so we can eventually compare them in different workloads.

BorisTheBlade82 · Jul 23, 2024

Sorry for double posting, but I think this information is of value for this thread as well:

In the linked article, Mike Clark indeed confirms, that they have 4 designs for Zen5 altogether, because they cut down AVX512 on mobile Zen5 and Zen5c:

For what we’re launching today in Strix Point, both the performance core and the compact core both have the AVX cut-down [AVX-256] because they're in a heterogeneous situation, and they're in a mobile platform where area is at a premium.

An interview with AMD's Mike Clark, the Father of Zen — 'Zen Daddy' says 3nm Zen 5 is coming fast; also talks compact cores for desktop chips

AMD expands on its Zen 5 architecture.

www.tomshardware.com

Saylick · Jul 24, 2024

A few more interesting slides for Zen 5 and Zen 5c popped up:

Zen 5: Weitere Details zur Architektur und dem SoC-Aufbau - Hardwareluxx

Zen 5: Weitere Details zur Architektur und dem SoC-Aufbau.

www.hardwareluxx.de

Looks like Zen 5c is officially confirmed as being 25% smaller then vanilla Zen 5.

Nothingness · Jul 24, 2024

Some Zen5/Zen5c uarch details have been revealed: https://www.phoronix.com/review/amd-zen-5-core

Edit: I somehow missed @Saylick post, sorry.

Saylick · Jul 24, 2024

Nothingness said:
Some Zen5/Zen5c uarch details have been revealed: https://www.phoronix.com/review/amd-zen-5-core

Edit: I somehow missed @Saylick post, sorry.

Heh, my bad.

But your article has a few slides that mine didn’t, so I’ll add those in.

Confirmation of “configurable FP512/256 datapath”:

Discussion Zen 5 Architecture & Technical discussion

Senior member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Member

Diamond Member

Diamond Member

Golden Member

Elite Member

Senior member

Diamond Member

Junior Member

Golden Member

Senior member

Junior Member

Diamond Member

Junior Member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Diamond Member