Discussion Zen 5 Architecture & Technical discussion

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136

Maybe WTFtech was not to blame? I don't know. Even the above site is saying that most of the slides are old.

Anyone spot anything new?
Indeed they seem to have the actual HC slides. But like @Saylick mentioned they just saw an opportunity to add filler text as usual. But to be expected.
Nothing new.

However, circling back to the decode. Pretty interesting that it matches a large part of the overall gist of these patents. 20220100519 & 20220100663

Still they mentioned parallel independent instruction streams coming from different basic blocks.
Even if parallel decode cannot be done within the same basic block, assuming the average x86 code runs for 8-10 instructions in a basic block we should be seeing this pattern play out where we get on average > 4 inst/cycle from the decode over a period of time.
I hope someone will discover something.

 

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,971
136
Isn't the mention of "instructions" for the OpCache new (I think they previously said micro-ops)?

At least they make it clear that the OpCache doesn't store macro-ops contrary to Zen4 (something I had already highlighted for the launch of Zen5). OTOH I somehow doubt they really store instructions (as is in plain x86 opcodes). If that was the case it'd mean they'd have to decode and split them before entering the UOPQ. This has to be some form of micro-op.
 
Reactions: lightmanek

MS_AT

Senior member
Jul 15, 2024
207
497
96
Still they mentioned parallel independent instruction streams coming from different basic blocks.
Even if parallel decode cannot be done within the same basic block, assuming the average x86 code runs for 8-10 instructions in a basic block we should be seeing this pattern play out where we get on average > 4 inst/cycle from the decode over a period of time.
I hope someone will discover something.
Do you happen to know what is the size of the basic block? Is that 32B? I also recall some papers where the decode is stopped if the decode encounters a taken branch I think. So with average 1 branch per 6 instructions and 4.5 B per instruction could make the throughput lower than 8-10 instructions. [the average data I gathered with quick googling so not sure how sound they really are].
Isn't the mention of "instructions" for the OpCache new (I think they previously said micro-ops)?

At least they make it clear that the OpCache doesn't store macro-ops contrary to Zen4 (something I had already highlighted for the launch of Zen5). OTOH I somehow doubt they really store instructions (as is in plain x86 opcodes). If that was the case it'd mean they'd have to decode and split them before entering the UOPQ. This has to be some form of micro-op.
This is what I was asking about some time ago
Also a question, which stage is doing the decoding now? Funnily it seems not to be the decoders themselves as both decoders and uop caches are sending instructions down to rename according to diagrams.
The diagrams are consistent between the original tech day slide deck, software optimization guide and HC slides. The diagrams in software optimization guide shows that apparently macro-ops are entering the Schedulers. And would suggest that macroOps -> microOps happens at rename and instruction -> macroOps happen at UOPQ stage.

But since the Software Optimization Guide looks like it could use an errata then not sure if we can trust those diagrams at all...
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Do you happen to know what is the size of the basic block? Is that 32B? I also recall some papers where the decode is stopped if the decode encounters a taken branch I think. So with average 1 branch per 6 instructions and 4.5 B per instruction could make the throughput lower than 8-10 instructions. [the average data I gathered with quick googling so not sure how sound they really are].
Basic block is basically beginning of a branch to the end of the branch.
Depending on the length of the instructions encountered a fetch block would comprise a varying number of bytes. Max fetch block is 32B.
And a basic block would contain a number of fetch blocks
The decode stops in case the BP cannot look beyond the next basic block and the current basic block is very small that the inst/cycle decode is enough to decode all insts in that single fetch block. Or stops due to stalling when existing ops cannot be dispatched due to bottlenecks in the backend.

Z5 BP can output two taken branches per cycle, so it already designed for this parallel look ahead execution.
The bad thing is that Z5 cannot use the dual decode in 1T.

Even if dual decode does not work with consecutive fetch blocks from same basic block , it should have been able to help do parallel decode on a second basic block and reorder the instructions when they come out.
Z5 2 taken branch/cycle is wasted here in most scenarios.

If we assume 8 insts average per basic block in common x86 code, that means decoder 0 can decode inst 0 -7 on one basic block and in parallel decoder 1 can decode inst 7 - 15 in another basic block, from a single code flow involving two jumps assuming the BP got the predictions right when looking ahead beyond the next branch. Then reorder inst 0 - 15 then before going to the op queue.

Even if not as great as working on two consecutive fetch blocks, the above sounds good enough, assuming there is enough OOO structures to handle this without stalling.
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
207
497
96
still no deep dive article about ZEN5 architecture from Hotchips ?
since they have shown little that is new, existing articles probably already covered everything. We don't know if there was any Q&A session and whoever got to ask questions asked the right ones, as usually at such events people ask about basic stuff so maybe all in all there was nothing to write home about.
 
Jul 27, 2020
19,613
13,477
146
Now the next hope is GDC 2025 (Mar 17, 2025) for more architectural bean spilling when AMD tells everyone how to avoid Zen 5 pitfalls in game development.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96
The Zen 5 Software Optimization Guide contains also an excel file listing instruction latencies. In there there is also a Notes sheet that contains a following note:
The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
There is no penalty for operations in the slow region that depend on longer latency operations or loads.
There is no penalty for any operations in the fast region.
To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
The latency test could interleave NOPs to prevent the scheduler from filling up.
This is about the 1 cycle latency regression for single cycle SIMD ops, that people were previously discussing here.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,966
136
Would such a latency test be benign for other architectures?

It's very hard to design microbenchmarks that are "fair" across architectures; the default answer is no.

That's why you should always test performance on real loads instead, microbenchmarks are for studying what is going on at the low level.

But this is an interesting case, I would assume most normal FP code does not see the schedulers full, there would be prediction failures often enough for that. The slow region presumably allows the frontend to run away further in cases that are FP-heavy, increasing performance when this allows loads to happen earlier. But possibly actually decreasing performance for extremely FP-heavy code that is well predicted and fits mostly in the L1 cache.
 

naukkis

Senior member
Jun 5, 2002
877
747
136
slow region presumably allows the frontend to run away further in cases that are FP-heavy, increasing performance when this allows loads to happen earlier. But possibly actually decreasing performance for extremely FP-heavy code that is well predicted and fits mostly in the L1 cache.
Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.
 
Reactions: lightmanek

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,966
136
Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.

On some loads the +1 latency seems universal, if the issue was having to pick the wrong reg file out of a split implementation, I'd expect at least some to hit the fast path.

I don't have a clue what is going on.
 
Jul 27, 2020
19,613
13,477
146
I don't have a clue what is going on.
They ran into some issues (security? latency related? thermal related? could be anything) so blocked some pathways and did things using the slower fallback paths. They do design redundancy in execution pathways, no? Just in case some of their moonshot ideas don't work out.

They have fallen into that trap before, with RDNA3 and it's dual issue thingy.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96
Would such a latency test be benign for other architectures?
No. You should read this: "We are claiming some instructions have best case latency of 1 cycle but your standard latency test might measure 2 cycles. To measure 1 cycle do this". That is the only purpose for the testing solution they propose. It's not meant for performance comparisons between architectures, but rather to get an intuition about how this particular uarch behaves.

The importance of that is that the mixed code that doesn't use SIMD too much will not see this latency. The tuned SIMD heavy code will, but also keep in mind that basic operations like floating point add, floating point multiply or FMA are not affected as they were already above 1 cycle latency. For SIMD INT, add operation might be affected as they are 1 cycle best case, and from general ops some shuffles might be affected but once again this become a problem with dependency chains as otherwise pipelining will hide it.

On some loads the +1 latency seems universal, if the issue was having to pick the wrong reg file out of a split implementation, I'd expect at least some to hit the fast path.

I don't have a clue what is going on.
The plus 1 cycle latency on loads comes from addressing modes [complex addressing adds 1 cycle], and it's orthogonal to the issue at hand. Since loads by default are > 1 cycle latency they are not affected by the schedulers being filled up. In addition they specifically mention if your instructions depends on loads or longer latency instructions the penalty doesn't apply.
Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.
All FP schedulers are using the same register file.

They have fallen into that trap before, with RDNA3 and it's dual issue thingy.
I wouldn't compare the two. With RNDA3 the problem is it has do be explicitly coded for, so lots of chips potential is unused.
 

naukkis

Senior member
Jun 5, 2002
877
747
136
All FP schedulers are using the same register file.
What's technical reason for that added latency between operations? Alpha 21264 behaves just like that, register file and execution units are split half and when dependent intructions have to send to both halves when scheluder is full, high ipc case, there is additional one cycle latency for reg file syncing between operations. At least I don't know other such a latency differences in execution phase.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96
What's technical reason for that added latency between operations? Alpha 21264 behaves just like that, register file and execution units are split half and when dependent intructions have to send to both halves when scheluder is full, high ipc case, there is additional one cycle latency for reg file syncing between operations. At least I don't know other such a latency differences in execution phase.
We would need an AMD engineer to answer that question. To the best of my knowledge, there is one register file for each domain. We don't know if it is internally divided. What we know is that the latency penalty applies only to a subset of instructions only under specific conditions, while what you describe for Alpha seems to apply always when the scheduler is full indifferent to what instructions are being executed.
 

naukkis

Senior member
Jun 5, 2002
877
747
136
We would need an AMD engineer to answer that question. To the best of my knowledge, there is one register file for each domain. We don't know if it is internally divided. What we know is that the latency penalty applies only to a subset of instructions only under specific conditions, while what you describe for Alpha seems to apply always when the scheduler is full indifferent to what instructions are being executed.
Alpha has that additional latency when instructions has to cross register file. Scheluder, in case of Alpha with help of programmer, tries to keep dependent instructions on same side of register file. But in situations like when there is 3 adds to scheluded in clock for 2+2 alu configuration one instruction needs to take wrong side and from that result dependent instructions see one cycle latency penalty. When scheluder can isolate dependent instructions to their own sides full thoughput can be maintained without any penalties.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |