Discussion Zen 5 Architecture & Technical discussion

gdansk · Aug 10, 2024

soresu said:
And yet a rather obvious low hanging fruit for future cores to improve upon.

Maybe. But more ports is actually the expensive part.
The obvious low hanging fruit for future cores is to fix the regression caused by a hazard that seemed unintentional and unrelated to width of the instruction.

MS_AT · Aug 10, 2024

Mahboi said:
Fascinating...
One thing I read somewhere is that Apple's performance success comes also from a fairly fat L2 rather than the more server-typical L1/2/3 AMD uses.
Now granted they also do it in GPUs while Nvidia stays with only L1/L2, maybe it's just a kink AMD will keep. But could it be that with Zen 6, we start seeing the latency bottleneck cured with a smaller or non-existent L3 on client, while a really fat L2 replacing it? Server apps very clearly gain a lot from Zen 5, the problem seems to be more that what we have here is a fully primed server chip that isn't really any kind of improvement in client.

It's less a server cache scheme than it's dictated by the overall core architecture. x64 targets high clocks so they have more levels of cache to ensure the closest one are fast. So what we might get are even more levels [Lion Cove with L0] or slight size increases unless they will pivot and go for wider and slower cores.

naukkis said:
Not the point. AMD does have only 2-load ports to fp register where everybody else has more. Even Intel E-cores will have 3 load ports, so probably being able to achieve better IPC for some scalar & 128 bit workloads. AMD has biggest FP-unit of all in their cpu's - and is nearly in situation that it will have worst IPC for desktop and mobile workloads.

At the moment P-cores have already 3 load ports for fp, are they doing considerably better than Zen4?

Mahboi said:
http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/ Ever so slightly, a percent or two.

Once again it depends on the workload. Alex has shown that this is the case for Y-cruncher SSE kernels, but Geekerwan has shown that for SSE SPECfp, there is a net improvement, so it depends It's important to note that most basic operations [add/substract/multiply/fma] had already 3 cycle latency for FP, so they are not affected by the latency increase. It's not like everything is worse, and people treat it as general regression...

naukkis said:
Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.

It really depends on the code. If we look at Spec as generalized proxy of "real" FP workload, then Zen4 was doing pretty fine vs Intel and Zen5 [both Strix and Granite] improved considerably with or without AVX512.

MS_AT · Aug 10, 2024

https://blog.hjc.im/zen-5-more-details-2.html 2nd part of David Huang's Strix Point analysis

Mahboi · Aug 10, 2024

MS_AT said:
It's less a server cache scheme than it's dictated by the overall core architecture. x64 targets high clocks so they have more levels of cache to ensure the closest one are fast. So what we might get are even more levels [Lion Cove with L0] or slight size increases unless they will pivot and go for wider and slower cores.

Seems unlikely to not say almost impossible for Zen 6...and even for 7 AMD is still probably going to follow up on their penny pinching philosophy.

MS_AT said:
Once again it depends on the workload. Alex has shown that this is the case for Y-cruncher SSE kernels, but Geekerwan has shown that for SSE SPECfp, there is a net improvement, so it depends It's important to note that most basic operations [add/substract/multiply/fma] had already 3 cycle latency for FP, so they are not affected by the latency increase. It's not like everything is worse, and people treat it as general regression...

Even if it were treated as a total regression, it's 1%. AVX 512 bit ops have a literal 99% upgrade with minimal latency. If anything, Zen 5 may surprisingly FineWine. We may see a great effort in compilers to massively improve AVX 512 usage in future programs.

It's kind of crazy to think that if you're right, the Z5 core is actually over heavy and it's the Z2 era uncore that is mostly/completely holding it back anymore. Well, except in FP.

Mahboi · Aug 10, 2024

AMD’s Strix Point: Zen 5 Hits Mobile

AMD’s Zen line has gone a long way since it brought AMD’s CPU efforts back from the dead. Successive Zen generations delivered steady improvements that made AMD an increasingly dangerou…

chipsandcheese.com

C&C Zen 5 and Zen5c article.

poke01 · Aug 10, 2024

From David

“
This reminds me of the outrageous rumors that some people have spread before, which eventually turned out to be wrong. They were not only slapped in the face, but also angrily claimed that Zen 5 was the worst architecture since the bulldozer. Readers who follow my Twitter may still remember that the PMC data I mentioned in the article actually began to be collected as early April. At that time, my purpose was to see how much work it took to achieve the performance improvement that some people boasted about. As a result, it was found that as long as you simply look at the PMC data, you can know that for the current x86 microarchitecture, it is simply a "dream" to achieve those outrageous rumored goals without sacrificing extreme frequency and extreme performance.

Therefore, my suggestion is to ignore those unreliable rumors, or to correct your expectations. Without a major breakthrough in the semiconductor process, I'm afraid that the CPU performance improvement in the next many years will only be this much. I will mention this sentence once after every chip manufacturer releases its product this year, because whether it is ARM's Cortex-X, the microarchitecture after Apple A13, or the x86 microarchitecture of AMD/Intel.”

yuri69 · Aug 11, 2024

MS_AT said:
https://blog.hjc.im/zen-5-more-details-2.html 2nd part of David Huang's Strix Point analysis

So from the currently available performance counters it seems the main culprit of Zen 4 IPC is the frontend - L1I misses are horribly up and L2BTB overrides are... wtf. This means either the counters are broken or the frontend lags behind Zen 4.

Mahboi said:
C&C Zen 5 and Zen5c article.

C&C confirms that Zen 5 can't decode >4 instructions when running a single thread. This means the clustered decoder is way less 1T-capable than 2020 Tremont...

Nothingness · Aug 11, 2024

yuri69 said:
So from the currently available performance counters it seems the main culprit of Zen 4 IPC is the frontend - L1I misses are horribly up and L2BTB overrides are... wtf. This means either the counters are broken or the frontend lags behind Zen 4.

The increase in L1i misses in gcc and perlbench makes no sense. If it really was that bad the score would have been significantly lower, not better.

I can think of two hypotheses (warning, I'm at my first coffee...): they changed the way they count Icache misses, or the second group of decoders (which doesn't seem to work as many expected) is counting misses while doing nothing or being bugged.

dttprofessor · Aug 11, 2024

Skymont is much smaller than zen 5c.

gdansk · Aug 11, 2024

dttprofessor said:
Skymont is much smaller than zen 5c.

How much smaller? Than which Zen 5C? There are at least two Zen 5C.

Abwx · Aug 11, 2024

yuri69 said:
C&C confirms that Zen 5 can't decode >4 instructions when running a single thread. This means the clustered decoder is way less 1T-capable than 2020 Tremont...

If that was the case then Zen 5 wouldnt have 9-13% better ST IPC than RPL, so there s rather something broken in the measuring methodology than in the decoder capabilities.

dttprofessor · Aug 11, 2024

gdansk said:
How much smaller? Than which Zen 5C? There are at least two Zen 5C.

Skymont is about 30% area of lioncore
Zen 5c is about 75% area of Zen 5

gdansk · Aug 11, 2024

dttprofessor said:
Skymont is about 30% area of lioncore
Zen 5c is about 75% area of Zen 5

And does that mean anything without a comparison of area of Lion Core to Zen 5?
N3 Zen 5C hasn't been measured yet, as far as I'm aware. I remain curious how you were able to state it so confidently.

dttprofessor · Aug 11, 2024

gdansk said:
And does that mean anything without a comparison of area of Lion Core to Zen 5?
N3 Zen 5C hasn't been measured yet, as far as I'm aware. I remain curious how you were able to state it so confidently.

Zen 5c is about 75% area of Zen 5, on stx(tsmc n4)

FlameTail · Aug 11, 2024

gdansk said:
And does that mean anything without a comparison of area of Lion Core to Zen 5?
N3 Zen 5C hasn't been measured yet, as far as I'm aware. I remain curious how you were able to state it so confidently.

There was this:

Hulk said:
View attachment 100637

CouncilorIrissa · Aug 11, 2024

Nothingness said:
The increase in L1i misses in gcc and perlbench makes no sense. If it really was that bad the score would have been significantly lower, not better.

I can think of two hypotheses (warning, I'm at my first coffee...): they changed the way they count Icache misses, or the second group of decoders (which doesn't seem to work as many expected) is counting misses while doing nothing or being bugged.

It's also that if it was indeed front-end that was the problem, we would see degradation in branchy workloads like web browsing, no? But that does not seem to be the case.

gdansk · Aug 11, 2024

dttprofessor said:
Zen 5c is about 75% area of Zen 5, on stx(tsmc n4)

And N3E Zen 5C vs N3B Skymont?
I find it odd you're confident it's much smaller without measuring. It should be smaller. But how much?

Tuna-Fish · Aug 11, 2024

Abwx said:
If that was the case then Zen 5 wouldnt have 9-13% better ST IPC than RPL, so there s rather something broken in the measuring methodology than in the decoder capabilities.

The µop cache can serve more ops per cycle to single thread.

Nothingness · Aug 11, 2024

CouncilorIrissa said:
It's also that if it was indeed front-end that was the problem, we would see degradation in branchy workloads like web browsing, no? But that does not seem to be the case.

xz leela and deepsjeng have difficult to predict branches (string comparisons, evaluation functions in game trees) and they don't have improvements or even slight regressions.

It also seems we can't rule out an issue in the second group of 4 decoders.

Abwx · Aug 11, 2024

Tuna-Fish said:
The µop cache can serve more ops per cycle to single thread.

The uops cache doesnt generate uops out of the vaccum, it still rely on the decoders decoded instructions flow to build its uops tables, if the decoding bandwith is not increased in respect of Zen 4 then uop cache related instructions flow wont increase and IPC will be stagnant.

AMD’s Strix Point: Zen 5 Hits Mobile

AMD’s Zen line has gone a long way since it brought AMD’s CPU efforts back from the dead. Successive Zen generations delivered steady improvements that made AMD an increasingly dangerou…

chipsandcheese.com

Nothingness · Aug 11, 2024

The uop cache was changed on Zen5. According to David Huang's measurements the op cache brings 23% to Zen5 and 13% to Zen4. This demonstrates Zen5 relies much more on it, and the uarch around it changed significantly. So it's hard to conclude anything since you can't isolate things that impact performance.

Abwx · Aug 11, 2024

By the same token what Mike Clark stated could be right, that is, that the two decoders can be used by a single thread, FI if 4 instructions are decoded by decoder A then decoder B can decode the next four instructions.

It would be impossible to know wich decoder was used for each set of 4 instructions, all you would see is that 4 instructions of each row were decoded without knowing wich decoder did the work, that could be decoder A that did all the work or it could have been shared successively.

naukkis · Aug 11, 2024

Abwx said:
The uops cache doesnt generate uops out of the vaccum, it still rely on the decoders decoded instructions flow to build its uops tables, if the decoding bandwith is not increased in respect of Zen 4 then uop cache related instructions flow wont increase and IPC will be stagnant.

You got it wrong. When uop cache hits it will reuse already decoded instructions and decoders aren't needed at all. Actually cpus will shut down decoders completely when loops are 100% cached in uop cache. Decoders are only needed when uop cache won't hit, and when they hit even partiallly CPU IPC can grow above decoding capabilities.

Nothingness · Aug 11, 2024

@Abwx Yeah it's hard to know what's going on. There could really be a bug here when using the 2 groups of decoders in single thread. As far as I know no one has demonstrated the decoding of more than 4 instructions per cycle. But this might be due to microbenchmarks being unable to demonstrate that.

Abwx · Aug 11, 2024

naukkis said:
You got it wrong. When uop cache hits it will reuse already decoded instructions and decoders aren't needed at all. Actually cpus will shut down decoders completely when loops are 100% cached in uop cache. Decoders are only needed when uop cache won't hit, and when they hit even partiallly CPU IPC can grow above decoding capabilities.

How much cycles before the decoder is put to use.?.
Instructions are flowing permanently from the instruction cache, so how much instructions can be picked from the uop cache before new instructions are required from the cache.?
Even if the uop cache can provide some instructions you ll still have to decode what will follow next, it s not like a full set of an app instructions can be provided by the uop cache.

Discussion Zen 5 Architecture & Technical discussion

Diamond Member

Senior member

Senior member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Member

Diamond Member

Lifer

Member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Lifer