Discussion Zen 5 Architecture & Technical discussion

FlameTail · Jun 4, 2024

DisEnchantment said:
Zen 5 was supposed to focus on efficiency as per FAD2022. Quite odd there was no mention of it.

Dual decode seems interesting, so Z5 would now be 8 wide (2x 4 wide), Tremont style.

That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?

DisEnchantment · Jun 4, 2024

FlameTail said:
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?

They can gate off the second decoder block when not needed, for instance when there are no branch instructions/not multiple inst streams.
However the inner details could be different and whether they can have same throughput as a single 8 wide is unlikely.

JustViewing · Jun 4, 2024

FlameTail said:
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?

Dual decoders might benefit SMT. Since Intel is removing HT, it would give AMD a nice advantage in multi threaded workflows especially in mobile. With increased ALU count and bigger dispatch window, Z5 could see significat improvement in SMT performance relative to Z4.

StefanR5R · Jun 5, 2024

On post #1: FAD2022 material could have been included, as thin as the Computex material was.

On post #2: Caution! Post #1 forbids speculation here.

igor_kavinski · Jun 5, 2024

I got the sense that they didn't want Zen 5 getting too much attention because the star of the show was supposed to be AI. Will have to wait for reviews/CnC deep dive/HotChips for the juicy details.

DavidC1 · Jun 5, 2024

FlameTail said:
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?

Clustered decode such as in Tremont is an attempt to solve the x86 vs ARM argument, where the variable instruction decode makes it more complex to decode thus require more transistors(thus more area and power). The relationship is said to be roughly quadratic. So the impact gets worse and worse with more decoders.

By splitting the decoders into smaller portions, you get to reduce that impact. There is likely some performance impact there, but a well-executed team(like the Intel Austin one) will make balanced compromises.

The entire lineup of Atoms are basically efforts to solve that issue:
-Bonnell with the "macro" rather than "micro-op" execution
-Goldmont with pre-decode cache
-Tremont with clustered decode
-Skymont with nanocode

DisEnchantment said:
They can gate off the second decoder block when not needed, for instance when there are no branch instructions/not multiple inst streams.
However the inner details could be different and whether they can have same throughput as a single 8 wide is unlikely.

It can, but not really.

JustViewing said:
Dual decoders might benefit SMT. Since Intel is removing HT, it would give AMD a nice advantage in multi threaded workflows especially in mobile. With increased ALU count and bigger dispatch window, Z5 could see significat improvement in SMT performance relative to Z4.

Maybe, but not the idea behind it.

Nothingness · Jun 5, 2024

DavidC1 said:
Clustered decode such as in Tremont is an attempt to solve the x86 vs ARM argument, where the variable instruction decode makes it more complex to decode thus require more transistors(thus more area and power). The relationship is said to be roughly quadratic. So the impact gets worse and worse with more decoders.

By splitting the decoders into smaller portions, you get to reduce that impact. There is likely some performance impact there, but a well-executed team(like the Intel Austin one) will make balanced compromises.

IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).

DavidC1 · Jun 5, 2024

Nothingness said:
IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).

I heard from an architect that the differences aren't just in the decoders, but yea what you said.

I don't believe it's something that'll overcome the differences in execution between two teams. x86 losing against ARM just confirms biases, but doesn't take into account that both AMD/Intel has fallen flat on their faces in consistent regularity over many decades, while ARM did not.

Tuna-Fish · Jun 5, 2024

Nothingness said:
IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).

Even after you know the lengths, you still have to pay the massively wide mux tree to align the instruction starts with the decoders. (Technically, this usually happens after first stage of decode begins on every byte boundary, but the point is, you still have to do it at some point.) This structure is huge and high-latency, and grows quadratically with decode width. (x86 instructions are 1-15 bytes long. First instruction slot selects first byte. Second instruction slot selects any byte between second, and 16th. Third slots selects any between third, and 31st. You get where this is going.) And unless instructions are always the same width, all those transistors switch every cycle, so you pay a lot of power too.

randomhero · Jun 6, 2024

Can someone joggle my memory, where have we info about exec. resources?
Usually when AMD widened core, it came with substantial "ipc" improvements. This core baffles me. We got good but not "AMD good" "ipc" uptick and what it seems great reduction in power(laptop oems are all over strix).
I do know that you can spend transistors for different gain.
They could have made a lot better presentation on that keynote though.

gdansk · Jun 6, 2024

randomhero said:
They could have made a lot better presentation on that keynote though.

Yes but it should be followed up in two forms:
1. The pre-release material given by AMD to the press before the release
2. The presentation AMD does at HotChips later this summer

Both of which are likely to include more actual information than Lisa's "AI" keynote with a few minutes of Zen.

StefanR5R · Jun 6, 2024

randomhero said:
Can someone joggle my memory, where have we info about exec. resources?

There are only rumors (edit: and work-in-progress code submissions to gcc).

randomhero said:
Usually when AMD widened core, it came with substantial "ipc" improvements. This core baffles me.

Are you implying that Zen 5 is wider than Zen 4? Careful now, because:

DisEnchantment said:
No Leaks

No speculation unrelated to publicly released technical materials/patches/manuals etc.

So far, the only released info related to the core width are the gcc patches.

(This thread has been opened too early.)

DisEnchantment · Jun 6, 2024

StefanR5R said:
There are only rumors (edit: and work-in-progress code submissions to gcc).

Are you implying that Zen 5 is wider than Zen 4? Careful now, because:

So far, the only released info related to the core width are the gcc patches.

(This thread has been opened too early.)

GCC patches are already final, and merged in GCC14, they are not work in progress.
Perf is also updated.
Both of these indicate 6x ALUs and 4x AGUs. This is not a speculation

Also SEV IOMMU, virtualized TSC, new PMC are available in the manuals, heterogeneous cores. What is missing is the SOG and family specific tech references.

StefanR5R · Jun 6, 2024

By "work in progress" I meant that AMD left stuff out of their submission, though presumably not for technical reasons but for disclosure reasons. I already forgot what it was and were it was discussed — was it fronted related? Did they follow up on that in the meantime?

Ajay · Jun 6, 2024

Ajay said:
Process node shrinks aren't going to be much of a help on the xtor/mm2 front going forward either. Going forward, node changes will be more towards power and performance improvements, IMHO.

Okay, that wasn't worded very well. I meant that this will have less influence because cache won't be shrinking much - and hence will become a larger and larger percentage of the die. Obviously, the shrinks will make a big difference difference in core logic. I just think that perf/watt (v/f curves for xtors) will be more dominant even is desktop design (being derivative anyway), as it appears to be from Zen5.

StefanR5R · Jun 11, 2024

DisEnchantment said:
More details will be added once they are public like die size, architecture details etc.

That would indeed be good to have in post #1. There was more made public already besides what was said at Computex.

igor_kavinski · Jun 18, 2024

AMD Ryzen 9 9950X listed at a lower price than 7950X3D at launch in Canada - VideoCardz.com

Canadian retailer lists Ryzen 9 9950X at 839 CAD, cheaper than 7950X(3D) at launch The new series may launch on July 31st. AMD appears to have adjusted their strategy and may avoid major price cuts for the upcoming Ryzen 9000 series launch next month. The new “Granite Ridge” CPUs, featuring the...

videocardz.com

Lower launch price, pre-orders apparently start on July 31st and most importantly, a rumored "Zen 5 Tech Day" in early July!

coercitiv · Jun 19, 2024

igor_kavinski said:
a rumored "Zen 5 Tech Day" in early July!

Launch in July --> July 31
Tech Day in early July --> July 14

igor_kavinski · Jun 22, 2024

EDIT: OP disapproved.

DisEnchantment · Jun 22, 2024

igor_kavinski said:
Zen 5 抢先体验：Ryzen AI 9 365 (Strix Point SoC) 简单测试 | David Huang's Blog
Juicy technical stuff (most of which I'm not qualified to understand but the resulting discussion should be interesting!).

Why put this here? This is not official stuffs

CouncilorIrissa · Jul 15, 2024

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

Today’s “article” is a little bit different to what you readers are used to. This article is a transcript of our video interview I conducted with Mike Clark at AMD. This was my fi…

chipsandcheese.com

maddie · Jul 15, 2024

CouncilorIrissa said:
A Video Interview with Mike Clark, Chief Architect of Zen at AMD

Today’s “article” is a little bit different to what you readers are used to. This article is a transcript of our video interview I conducted with Mike Clark at AMD. This was my fi…

chipsandcheese.com

From the interview. Interesting.

"George Cozma: You know, for a single thread of it, let’s say you’re running a workload that only uses one thread on a given core. Can a single thread take advantage of all of the front-end resources and can it take advantage of both decode clusters and the entirety of the dual ported OP cache?

Mike Clark: The answer is yes, and it’s a great question to ask because I explain SMT to a lot of people, they come in with the notion that we don’t [and] they aren’t able to use all these resources when we’re in single threaded mode, but our design philosophy is that barring a few, very rare microarchitectural exceptions, everything that matters is available in one thread mode. If we imagine we are removing [SMT] it’s not like we’d go shrink anything. There’s nothing to shrink. This is what we need for good, strong single threaded performance. And we’ve already built that."

soresu · Jul 15, 2024

YT link.

SarahKerrigan · Jul 15, 2024

maddie said:
From the interview. Interesting.

"George Cozma: You know, for a single thread of it, let’s say you’re running a workload that only uses one thread on a given core. Can a single thread take advantage of all of the front-end resources and can it take advantage of both decode clusters and the entirety of the dual ported OP cache?

Mike Clark: The answer is yes, and it’s a great question to ask because I explain SMT to a lot of people, they come in with the notion that we don’t [and] they aren’t able to use all these resources when we’re in single threaded mode, but our design philosophy is that barring a few, very rare microarchitectural exceptions, everything that matters is available in one thread mode. If we imagine we are removing [SMT] it’s not like we’d go shrink anything. There’s nothing to shrink. This is what we need for good, strong single threaded performance. And we’ve already built that."

Yep. This is what people miss when they start reading too much into grand pronouncements about how SMT is holding back ST perf.

SMT, in its most basic form, only costs you the price of additional tagging bits on uops and structure entries.

soresu · Jul 15, 2024

µArch detail slides:

Discussion Zen 5 Architecture & Technical discussion

Diamond Member

Golden Member

Senior member

Elite Member

Lifer

Senior member

Diamond Member

Senior member

Golden Member

Member

Platinum Member

Elite Member

Golden Member

Elite Member

Lifer

Elite Member

Lifer

Diamond Member

Lifer

Golden Member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member