Discussion Zen 5 Architecture & Technical discussion

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BorisTheBlade82

Senior member
May 1, 2020
687
1,084
136
Although it is not on the slides, according to THG the Reorder Buffer has 448 entries.
Haven't had the time to watch the CnC video but I guess they might provide some more figures as well - or in an article later on.
 

BorisTheBlade82

Senior member
May 1, 2020
687
1,084
136
I think, CnC will elaborate on that in their written article later on. But from the video it was a rather clear "yes". For a full confirmation, micro benchmarks will be needed.
 

Saylick

Diamond Member
Sep 10, 2012
3,646
8,223
136
Some rough math (normalized to 16% IPC gain):
Component​
Portion of Total IPC Gain​
IPC Gain​
Fetch/Branch Prediction​
12.8%​
2.05%​
Decode/Opcache​
26.8%​
4.29%​
Execution/Retire​
33.6%​
5.38%​
Data Bandwidth​
26.8%​
4.29%​

 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,933
96
Some rough math (normalized to 16% IPC gain):
Component​
Portion of Total IPC Gain​
IPC Gain​
Fetch/Branch Prediction​
12.8%​
2.05%​
Decode/Opcache​
26.8%​
4.29%​
Execution/Retire​
33.6%​
5.38%​
Data Bandwidth​
26.8%​
4.29%​

View attachment 103113
Look at this. A whole 2% gain for fetch and branch prediction. This is how CPU designs are.

And why we scrutinize every way the parts are tested. Because maybe using a slow SSD instead of a fast SSD would result in "Oops! There goes all the gains from the branch prediction unit!".
 

Saylick

Diamond Member
Sep 10, 2012
3,646
8,223
136
At 2:10 in the CnC's video Mr. Clark clearly states they still do decoding in-order. What does that mean in context of Tremont+ style of fetch & decode?
I interpreted that as meaning you still have fetch ---> decode, which is in-order, but there's two parallel paths now instead of one, meaning you have a dual-ported instruction fetch feeding into dual decoders. I also interpreted Mike's following statements about each decoder knowing where to start as meaning both decoders can work on the same instruction stream.
 

zacharychieply

Junior Member
Apr 22, 2020
9
4
81
for those wondering, it sounds like from the interview, that zen 5 is 8 wide decode, but can only decode from one branch path of a single thread at a time per that decode unit. so real world gains would still be only around the previous zen 4 offerings of 4x ILP.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,933
96
I interpreted that as meaning you still have fetch ---> decode, which is in-order, but there's two parallel paths now instead of one, meaning you have a dual-ported instruction fetch feeding into dual decoders. I also interpreted Mike's following statements about each decoder knowing where to start as meaning both decoders can work on the same instruction stream.
I again want to know why David Huang's results showed zero parallelism from the two clusters in ST. Is it disabled in mobile or does it only work situationally?

If it only works situationally then they decided to do so to limit increase in resources and leave it for future generations when process advancements give them more room. And with SMT they still could take full advantage of it.

Going from 1x9 throughput uopcache to 2x6 throughput uopcache is also an SMT minded decision, because with ST, it would be a downgrade. The BPU can also fetch twice as much as before, fitting the dual cluster setup.
 

CouncilorIrissa

Senior member
Jul 28, 2023
575
2,256
96
I again want to know why David Huang's results showed zero parallelism from the two clusters in ST. Is it disabled in mobile or does it only work situationally?
I would exercise patience in this situation tbh. Huang's article was based on an ES with a non-final microcode and he himself admitted to having been time-constrained. Let's wait until C&C get their hands on the thing.
 

zacharychieply

Junior Member
Apr 22, 2020
9
4
81
I know this is speculation, but if both decoders could work on the same code path, he woudn't have been so vague about it in the interview.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,978
3,656
136
for those wondering, it sounds like from the interview, that zen 5 is 8 wide decode, but can only decode from one branch path of a single thread at a time per that decode unit. so real world gains would still be only around the previous zen 4 offerings of 4x ILP.
Except you have a 6k uop cache, so even if that is the case , i listed to that section 3 times and cant find anything as definitive as you claim your claim of ILP, it wont hold true at dispatch.
 

zacharychieply

Junior Member
Apr 22, 2020
9
4
81
Except you have a 6k uop cache, so even if that is the case , i listed to that section 3 times and cant find anything as definitive as you claim your claim of ILP, it wont hold true at dispatch.
uop cache gets around decoding, but hits in the cache only apply to small loops, so i would argue its not really sustained perf here.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,978
3,656
136
uop cache gets around decoding, but hits in the cache only apply to small loops, so i would argue its not really sustained perf here.
would they not be storing highly likely taken branches uops in the op cache. So when you look over time your decode a cycle a thread is higher . Mike said they can "look ahead" for branches.
 

cherullo

Member
May 19, 2019
51
120
106
Going from 1x9 throughput uopcache to 2x6 throughput uopcache is also an SMT minded decision, because with ST, it would be a downgrade.

As far as I understand, entries in the uop-cache contain up to N consecutive instructions starting from a given address, but it may contain less instructions if:
- An instruction crosses a cacheline, or;
- There is a branch in the instruction stream, or;
- The current entry doesn't have enough space to hold an instruction that is decoded into multiple uops (probably rare though).

So if the new uop-cache can, when operating in 1T mode, retrieve two consecutive entries or entries across branches, then it's actually a win in every case. And it should be quite doable since the BTB already has the target address of the next branch.

I agree that this probably has greater impact in SMT, though. I hope someone figures out the chicken bits for these features so we can eventually compare them in different workloads.
 

BorisTheBlade82

Senior member
May 1, 2020
687
1,084
136
Sorry for double posting, but I think this information is of value for this thread as well:

In the linked article, Mike Clark indeed confirms, that they have 4 designs for Zen5 altogether, because they cut down AVX512 on mobile Zen5 and Zen5c:

For what we’re launching today in Strix Point, both the performance core and the compact core both have the AVX cut-down [AVX-256] because they're in a heterogeneous situation, and they're in a mobile platform where area is at a premium.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |