Discussion Zen 5 Architecture & Technical discussion

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

FlameTail

Diamond Member
Dec 15, 2021
3,852
2,295
106
Zen 5 was supposed to focus on efficiency as per FAD2022. Quite odd there was no mention of it.

Dual decode seems interesting, so Z5 would now be 8 wide (2x 4 wide), Tremont style.
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?
They can gate off the second decoder block when not needed, for instance when there are no branch instructions/not multiple inst streams.
However the inner details could be different and whether they can have same throughput as a single 8 wide is unlikely.
 

JustViewing

Senior member
Aug 17, 2022
217
383
106
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?
Dual decoders might benefit SMT. Since Intel is removing HT, it would give AMD a nice advantage in multi threaded workflows especially in mobile. With increased ALU count and bigger dispatch window, Z5 could see significat improvement in SMT performance relative to Z4.
 

StefanR5R

Elite Member
Dec 10, 2016
5,913
8,818
136
On post #1: FAD2022 material could have been included, as thin as the Computex material was.

On post #2: Caution! Post #1 forbids speculation here.
 
Jul 27, 2020
19,823
13,588
146
I got the sense that they didn't want Zen 5 getting too much attention because the star of the show was supposed to be AI. Will have to wait for reviews/CnC deep dive/HotChips for the juicy details.
 

DavidC1

Senior member
Dec 29, 2023
829
1,304
96
That is very interesting indeed. Why do split decoders? What's the benefit? Why not a monolithic 8-wide decoder?
Clustered decode such as in Tremont is an attempt to solve the x86 vs ARM argument, where the variable instruction decode makes it more complex to decode thus require more transistors(thus more area and power). The relationship is said to be roughly quadratic. So the impact gets worse and worse with more decoders.

By splitting the decoders into smaller portions, you get to reduce that impact. There is likely some performance impact there, but a well-executed team(like the Intel Austin one) will make balanced compromises.

The entire lineup of Atoms are basically efforts to solve that issue:
-Bonnell with the "macro" rather than "micro-op" execution
-Goldmont with pre-decode cache
-Tremont with clustered decode
-Skymont with nanocode
They can gate off the second decoder block when not needed, for instance when there are no branch instructions/not multiple inst streams.
However the inner details could be different and whether they can have same throughput as a single 8 wide is unlikely.
It can, but not really.
Dual decoders might benefit SMT. Since Intel is removing HT, it would give AMD a nice advantage in multi threaded workflows especially in mobile. With increased ALU count and bigger dispatch window, Z5 could see significat improvement in SMT performance relative to Z4.
Maybe, but not the idea behind it.
 

Nothingness

Diamond Member
Jul 3, 2013
3,063
2,041
136
Clustered decode such as in Tremont is an attempt to solve the x86 vs ARM argument, where the variable instruction decode makes it more complex to decode thus require more transistors(thus more area and power). The relationship is said to be roughly quadratic. So the impact gets worse and worse with more decoders.

By splitting the decoders into smaller portions, you get to reduce that impact. There is likely some performance impact there, but a well-executed team(like the Intel Austin one) will make balanced compromises.
IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).
 

DavidC1

Senior member
Dec 29, 2023
829
1,304
96
IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).
I heard from an architect that the differences aren't just in the decoders, but yea what you said.

I don't believe it's something that'll overcome the differences in execution between two teams. x86 losing against ARM just confirms biases, but doesn't take into account that both AMD/Intel has fallen flat on their faces in consistent regularity over many decades, while ARM did not.
 
Reactions: Nothingness

Tuna-Fish

Golden Member
Mar 4, 2011
1,475
1,975
136
IIRC what is quadratic is the number of possibilities of instruction lengths in a fetch packet. But there are ways to get around the issue by annotating data in the Icache lines. You pay the prices of variable length instructions once (until the line is evicted of course).

Even after you know the lengths, you still have to pay the massively wide mux tree to align the instruction starts with the decoders. (Technically, this usually happens after first stage of decode begins on every byte boundary, but the point is, you still have to do it at some point.) This structure is huge and high-latency, and grows quadratically with decode width. (x86 instructions are 1-15 bytes long. First instruction slot selects first byte. Second instruction slot selects any byte between second, and 16th. Third slots selects any between third, and 31st. You get where this is going.) And unless instructions are always the same width, all those transistors switch every cycle, so you pay a lot of power too.
 

randomhero

Member
Apr 28, 2020
184
251
136
Can someone joggle my memory, where have we info about exec. resources?
Usually when AMD widened core, it came with substantial "ipc" improvements. This core baffles me. We got good but not "AMD good" "ipc" uptick and what it seems great reduction in power(laptop oems are all over strix).
I do know that you can spend transistors for different gain.
They could have made a lot better presentation on that keynote though.
 

gdansk

Platinum Member
Feb 8, 2011
2,890
4,358
136
They could have made a lot better presentation on that keynote though.
Yes but it should be followed up in two forms:
1. The pre-release material given by AMD to the press before the release
2. The presentation AMD does at HotChips later this summer

Both of which are likely to include more actual information than Lisa's "AI" keynote with a few minutes of Zen.
 

StefanR5R

Elite Member
Dec 10, 2016
5,913
8,818
136
Can someone joggle my memory, where have we info about exec. resources?
There are only rumors (edit: and work-in-progress code submissions to gcc).

Usually when AMD widened core, it came with substantial "ipc" improvements. This core baffles me.
Are you implying that Zen 5 is wider than Zen 4? Careful now, because:
  • No Leaks
  • No speculation unrelated to publicly released technical materials/patches/manuals etc.
So far, the only released info related to the core width are the gcc patches.

(This thread has been opened too early.)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
There are only rumors (edit: and work-in-progress code submissions to gcc).


Are you implying that Zen 5 is wider than Zen 4? Careful now, because:

So far, the only released info related to the core width are the gcc patches.

(This thread has been opened too early.)
GCC patches are already final, and merged in GCC14, they are not work in progress.
Perf is also updated.
Both of these indicate 6x ALUs and 4x AGUs. This is not a speculation

Also SEV IOMMU, virtualized TSC, new PMC are available in the manuals, heterogeneous cores. What is missing is the SOG and family specific tech references.
 

StefanR5R

Elite Member
Dec 10, 2016
5,913
8,818
136
By "work in progress" I meant that AMD left stuff out of their submission, though presumably not for technical reasons but for disclosure reasons. I already forgot what it was and were it was discussed — was it fronted related? Did they follow up on that in the meantime?
 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
Process node shrinks aren't going to be much of a help on the xtor/mm2 front going forward either. Going forward, node changes will be more towards power and performance improvements, IMHO.
Okay, that wasn't worded very well. I meant that this will have less influence because cache won't be shrinking much - and hence will become a larger and larger percentage of the die. Obviously, the shrinks will make a big difference difference in core logic. I just think that perf/watt (v/f curves for xtors) will be more dominant even is desktop design (being derivative anyway), as it appears to be from Zen5.
 
Jul 27, 2020
19,823
13,588
146

Lower launch price, pre-orders apparently start on July 31st and most importantly, a rumored "Zen 5 Tech Day" in early July!
 
Reactions: Tlh97 and poke01

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
From the interview. Interesting.

"George Cozma: You know, for a single thread of it, let’s say you’re running a workload that only uses one thread on a given core. Can a single thread take advantage of all of the front-end resources and can it take advantage of both decode clusters and the entirety of the dual ported OP cache?

Mike Clark: The answer is yes, and it’s a great question to ask because I explain SMT to a lot of people, they come in with the notion that we don’t [and] they aren’t able to use all these resources when we’re in single threaded mode, but our design philosophy is that barring a few, very rare microarchitectural exceptions, everything that matters is available in one thread mode. If we imagine we are removing [SMT] it’s not like we’d go shrink anything. There’s nothing to shrink. This is what we need for good, strong single threaded performance. And we’ve already built that."
 
Reactions: lightmanek

SarahKerrigan

Senior member
Oct 12, 2014
735
2,035
136
From the interview. Interesting.

"George Cozma: You know, for a single thread of it, let’s say you’re running a workload that only uses one thread on a given core. Can a single thread take advantage of all of the front-end resources and can it take advantage of both decode clusters and the entirety of the dual ported OP cache?

Mike Clark: The answer is yes, and it’s a great question to ask because I explain SMT to a lot of people, they come in with the notion that we don’t [and] they aren’t able to use all these resources when we’re in single threaded mode, but our design philosophy is that barring a few, very rare microarchitectural exceptions, everything that matters is available in one thread mode. If we imagine we are removing [SMT] it’s not like we’d go shrink anything. There’s nothing to shrink. This is what we need for good, strong single threaded performance. And we’ve already built that."

Yep. This is what people miss when they start reading too much into grand pronouncements about how SMT is holding back ST perf.

SMT, in its most basic form, only costs you the price of additional tagging bits on uops and structure entries.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |