Discussion Zen 5 Architecture & Technical discussion

DisEnchantment · Aug 28, 2024

igor_kavinski said:
AMD Zen 5 Core is at Hot Chips 2024

At Hot Chips 2024, AMD detailed Zen 5 again. If you missed our previous coverage, you can see the summary slides here

www.servethehome.com

Maybe WTFtech was not to blame? I don't know. Even the above site is saying that most of the slides are old.

Anyone spot anything new?

Indeed they seem to have the actual HC slides. But like @Saylick mentioned they just saw an opportunity to add filler text as usual. But to be expected.
Nothing new.

However, circling back to the decode. Pretty interesting that it matches a large part of the overall gist of these patents. 20220100519 & 20220100663

Still they mentioned parallel independent instruction streams coming from different basic blocks.
Even if parallel decode cannot be done within the same basic block, assuming the average x86 code runs for 8-10 instructions in a basic block we should be seeing this pattern play out where we get on average > 4 inst/cycle from the decode over a period of time.
I hope someone will discover something.

Nothingness · Aug 28, 2024

Isn't the mention of "instructions" for the OpCache new (I think they previously said micro-ops)?

At least they make it clear that the OpCache doesn't store macro-ops contrary to Zen4 (something I had already highlighted for the launch of Zen5). OTOH I somehow doubt they really store instructions (as is in plain x86 opcodes). If that was the case it'd mean they'd have to decode and split them before entering the UOPQ. This has to be some form of micro-op.

igor_kavinski · Aug 28, 2024

PJVol said:
does Zen5c CCD still have two CCXs ?

This says that all the Zen5c cores are on a single CCX which is separate from the Zen 5 CCX: https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370-review/3

And the inter CCX latencies between the two dissimilar CCXs are horrendous.

CouncilorIrissa · Aug 28, 2024

igor_kavinski said:
This says that all the Zen5c cores are on a single CCX which is separate from the Zen 5 CCX: https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370-review/3

And the inter CCX latencies between the two dissimilar CCXs are horrendous.

I don't think that's what he's asking. He's asking about Turin Dense CCDs.

MS_AT · Aug 28, 2024

DisEnchantment said:
Still they mentioned parallel independent instruction streams coming from different basic blocks.
Even if parallel decode cannot be done within the same basic block, assuming the average x86 code runs for 8-10 instructions in a basic block we should be seeing this pattern play out where we get on average > 4 inst/cycle from the decode over a period of time.
I hope someone will discover something.

Do you happen to know what is the size of the basic block? Is that 32B? I also recall some papers where the decode is stopped if the decode encounters a taken branch I think. So with average 1 branch per 6 instructions and 4.5 B per instruction could make the throughput lower than 8-10 instructions. [the average data I gathered with quick googling so not sure how sound they really are].

Nothingness said:
Isn't the mention of "instructions" for the OpCache new (I think they previously said micro-ops)?

At least they make it clear that the OpCache doesn't store macro-ops contrary to Zen4 (something I had already highlighted for the launch of Zen5). OTOH I somehow doubt they really store instructions (as is in plain x86 opcodes). If that was the case it'd mean they'd have to decode and split them before entering the UOPQ. This has to be some form of micro-op.

This is what I was asking about some time ago

MS_AT said:
Also a question, which stage is doing the decoding now? Funnily it seems not to be the decoders themselves as both decoders and uop caches are sending instructions down to rename according to diagrams.

The diagrams are consistent between the original tech day slide deck, software optimization guide and HC slides. The diagrams in software optimization guide shows that apparently macro-ops are entering the Schedulers. And would suggest that macroOps -> microOps happens at rename and instruction -> macroOps happen at UOPQ stage.

But since the Software Optimization Guide looks like it could use an errata then not sure if we can trust those diagrams at all...

PJVol · Aug 28, 2024

CouncilorIrissa said:
I don't think that's what he's asking. He's asking about Turin Dense CCDs.

Yeah, I meant is 16-core CCX really a thing?

inquiss · Aug 28, 2024

PJVol said:
Yeah, I meant is 16-core CCX really a thing?

It is for Turin D

DisEnchantment · Aug 28, 2024

MS_AT said:
Do you happen to know what is the size of the basic block? Is that 32B? I also recall some papers where the decode is stopped if the decode encounters a taken branch I think. So with average 1 branch per 6 instructions and 4.5 B per instruction could make the throughput lower than 8-10 instructions. [the average data I gathered with quick googling so not sure how sound they really are].

Basic block is basically beginning of a branch to the end of the branch.
Depending on the length of the instructions encountered a fetch block would comprise a varying number of bytes. Max fetch block is 32B.
And a basic block would contain a number of fetch blocks
The decode stops in case the BP cannot look beyond the next basic block and the current basic block is very small that the inst/cycle decode is enough to decode all insts in that single fetch block. Or stops due to stalling when existing ops cannot be dispatched due to bottlenecks in the backend.

Z5 BP can output two taken branches per cycle, so it already designed for this parallel look ahead execution.
The bad thing is that Z5 cannot use the dual decode in 1T.

Even if dual decode does not work with consecutive fetch blocks from same basic block , it should have been able to help do parallel decode on a second basic block and reorder the instructions when they come out.
Z5 2 taken branch/cycle is wasted here in most scenarios.

If we assume 8 insts average per basic block in common x86 code, that means decoder 0 can decode inst 0 -7 on one basic block and in parallel decoder 1 can decode inst 7 - 15 in another basic block, from a single code flow involving two jumps assuming the BP got the predictions right when looking ahead beyond the next branch. Then reorder inst 0 - 15 then before going to the op queue.

Even if not as great as working on two consecutive fetch blocks, the above sounds good enough, assuming there is enough OOO structures to handle this without stalling.

PJVol · Aug 28, 2024

inquiss said:
It is for Turin D

What's the core's layout of Turin D CCX? Are you sure you're not confusing CCX with CCD?

StefanR5R · Aug 28, 2024

PJVol said:
What's the core's layout of Turin D CCX?

There is no official information available yet, beyond that "future Zen 5 products" may have "smaller, larger CCXs" compared to those of the two Zen 5 products which have been launched so far (Strix Point and Granite Ridge). Anything else is mere speculation and rumors.

del42sa · Aug 29, 2024

still no deep dive article about ZEN5 architecture from Hotchips ?

MS_AT · Aug 29, 2024

del42sa said:
still no deep dive article about ZEN5 architecture from Hotchips ?

since they have shown little that is new, existing articles probably already covered everything. We don't know if there was any Q&A session and whoever got to ask questions asked the right ones, as usually at such events people ask about basic stuff so maybe all in all there was nothing to write home about.

igor_kavinski · Aug 29, 2024

Now the next hope is GDC 2025 (Mar 17, 2025) for more architectural bean spilling when AMD tells everyone how to avoid Zen 5 pitfalls in game development.

itsmydamnation · Aug 29, 2024

Isscc probably would have something interesting

del42sa · Aug 29, 2024

itsmydamnation said:
Isscc probably would have something interesting

or Hotchips 2025

MS_AT · Sep 6, 2024

The Zen 5 Software Optimization Guide contains also an excel file listing instruction latencies. In there there is also a Notes sheet that contains a following note:

The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
There is no penalty for operations in the slow region that depend on longer latency operations or loads.
There is no penalty for any operations in the fast region.
To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
The latency test could interleave NOPs to prevent the scheduler from filling up.

This is about the 1 cycle latency regression for single cycle SIMD ops, that people were previously discussing here.

igor_kavinski · Sep 6, 2024

MS_AT said:
This is about the 1 cycle latency regression for single cycle SIMD ops, that people were previously discussing here.

The latency test could interleave NOPs to prevent the scheduler from filling up.

Would such a latency test be benign for other architectures?

Tuna-Fish · Sep 7, 2024

igor_kavinski said:
Would such a latency test be benign for other architectures?

It's very hard to design microbenchmarks that are "fair" across architectures; the default answer is no.

That's why you should always test performance on real loads instead, microbenchmarks are for studying what is going on at the low level.

But this is an interesting case, I would assume most normal FP code does not see the schedulers full, there would be prediction failures often enough for that. The slow region presumably allows the frontend to run away further in cases that are FP-heavy, increasing performance when this allows loads to happen earlier. But possibly actually decreasing performance for extremely FP-heavy code that is well predicted and fits mostly in the L1 cache.

naukkis · Sep 7, 2024

Tuna-Fish said:
slow region presumably allows the frontend to run away further in cases that are FP-heavy, increasing performance when this allows loads to happen earlier. But possibly actually decreasing performance for extremely FP-heavy code that is well predicted and fits mostly in the L1 cache.

Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.

Tuna-Fish · Sep 7, 2024

naukkis said:
Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.

On some loads the +1 latency seems universal, if the issue was having to pick the wrong reg file out of a split implementation, I'd expect at least some to hit the fast path.

I don't have a clue what is going on.

igor_kavinski · Sep 7, 2024

Tuna-Fish said:
I don't have a clue what is going on.

They ran into some issues (security? latency related? thermal related? could be anything) so blocked some pathways and did things using the slower fallback paths. They do design redundancy in execution pathways, no? Just in case some of their moonshot ideas don't work out.

They have fallen into that trap before, with RDNA3 and it's dual issue thingy.

MS_AT · Sep 7, 2024

igor_kavinski said:
Would such a latency test be benign for other architectures?

No. You should read this: "We are claiming some instructions have best case latency of 1 cycle but your standard latency test might measure 2 cycles. To measure 1 cycle do this". That is the only purpose for the testing solution they propose. It's not meant for performance comparisons between architectures, but rather to get an intuition about how this particular uarch behaves.

The importance of that is that the mixed code that doesn't use SIMD too much will not see this latency. The tuned SIMD heavy code will, but also keep in mind that basic operations like floating point add, floating point multiply or FMA are not affected as they were already above 1 cycle latency. For SIMD INT, add operation might be affected as they are 1 cycle best case, and from general ops some shuffles might be affected but once again this become a problem with dependency chains as otherwise pipelining will hide it.

Tuna-Fish said:
On some loads the +1 latency seems universal, if the issue was having to pick the wrong reg file out of a split implementation, I'd expect at least some to hit the fast path.

I don't have a clue what is going on.

The plus 1 cycle latency on loads comes from addressing modes [complex addressing adds 1 cycle], and it's orthogonal to the issue at hand. Since loads by default are > 1 cycle latency they are not affected by the schedulers being filled up. In addition they specifically mention if your instructions depends on loads or longer latency instructions the penalty doesn't apply.

naukkis said:
Or it's just that near full scheluder has to assign some instructions to wrong fp register file needing cross register file moves which takes that additional cycle.

All FP schedulers are using the same register file.

igor_kavinski said:
They have fallen into that trap before, with RDNA3 and it's dual issue thingy.

I wouldn't compare the two. With RNDA3 the problem is it has do be explicitly coded for, so lots of chips potential is unused.

naukkis · Sep 7, 2024

MS_AT said:
All FP schedulers are using the same register file.

What's technical reason for that added latency between operations? Alpha 21264 behaves just like that, register file and execution units are split half and when dependent intructions have to send to both halves when scheluder is full, high ipc case, there is additional one cycle latency for reg file syncing between operations. At least I don't know other such a latency differences in execution phase.

MS_AT · Sep 7, 2024

naukkis said:
What's technical reason for that added latency between operations? Alpha 21264 behaves just like that, register file and execution units are split half and when dependent intructions have to send to both halves when scheluder is full, high ipc case, there is additional one cycle latency for reg file syncing between operations. At least I don't know other such a latency differences in execution phase.

We would need an AMD engineer to answer that question. To the best of my knowledge, there is one register file for each domain. We don't know if it is internally divided. What we know is that the latency penalty applies only to a subset of instructions only under specific conditions, while what you describe for Alpha seems to apply always when the scheduler is full indifferent to what instructions are being executed.

naukkis · Sep 7, 2024

MS_AT said:
We would need an AMD engineer to answer that question. To the best of my knowledge, there is one register file for each domain. We don't know if it is internally divided. What we know is that the latency penalty applies only to a subset of instructions only under specific conditions, while what you describe for Alpha seems to apply always when the scheduler is full indifferent to what instructions are being executed.

Alpha has that additional latency when instructions has to cross register file. Scheluder, in case of Alpha with help of programmer, tries to keep dependent instructions on same side of register file. But in situations like when there is 3 adds to scheluded in clock for 2+2 alu configuration one instruction needs to take wrong side and from that result dependent instructions see one cycle latency penalty. When scheluder can isolate dependent instructions to their own sides full thoughput can be maintained without any penalties.

Discussion Zen 5 Architecture & Technical discussion

Golden Member

Diamond Member

Lifer

Senior member

Senior member

Senior member

Member

Golden Member

Senior member

Elite Member

Member

Senior member

Lifer

Platinum Member

Member

Senior member

Lifer

Golden Member

Senior member

Golden Member

Lifer

Senior member

Senior member

Senior member

Senior member