Discussion Zen 5 Architecture & Technical discussion

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MS_AT

Senior member
Jul 15, 2024
207
497
96
Fascinating...
One thing I read somewhere is that Apple's performance success comes also from a fairly fat L2 rather than the more server-typical L1/2/3 AMD uses.
Now granted they also do it in GPUs while Nvidia stays with only L1/L2, maybe it's just a kink AMD will keep. But could it be that with Zen 6, we start seeing the latency bottleneck cured with a smaller or non-existent L3 on client, while a really fat L2 replacing it? Server apps very clearly gain a lot from Zen 5, the problem seems to be more that what we have here is a fully primed server chip that isn't really any kind of improvement in client.
It's less a server cache scheme than it's dictated by the overall core architecture. x64 targets high clocks so they have more levels of cache to ensure the closest one are fast. So what we might get are even more levels [Lion Cove with L0] or slight size increases unless they will pivot and go for wider and slower cores.
Not the point. AMD does have only 2-load ports to fp register where everybody else has more. Even Intel E-cores will have 3 load ports, so probably being able to achieve better IPC for some scalar & 128 bit workloads. AMD has biggest FP-unit of all in their cpu's - and is nearly in situation that it will have worst IPC for desktop and mobile workloads.
At the moment P-cores have already 3 load ports for fp, are they doing considerably better than Zen4?
Once again it depends on the workload. Alex has shown that this is the case for Y-cruncher SSE kernels, but Geekerwan has shown that for SSE SPECfp, there is a net improvement, so it depends It's important to note that most basic operations [add/substract/multiply/fma] had already 3 cycle latency for FP, so they are not affected by the latency increase. It's not like everything is worse, and people treat it as general regression...
Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.
It really depends on the code. If we look at Spec as generalized proxy of "real" FP workload, then Zen4 was doing pretty fine vs Intel and Zen5 [both Strix and Granite] improved considerably with or without AVX512.
 

Mahboi

Golden Member
Apr 4, 2024
1,001
1,803
96
It's less a server cache scheme than it's dictated by the overall core architecture. x64 targets high clocks so they have more levels of cache to ensure the closest one are fast. So what we might get are even more levels [Lion Cove with L0] or slight size increases unless they will pivot and go for wider and slower cores.
Seems unlikely to not say almost impossible for Zen 6...and even for 7 AMD is still probably going to follow up on their penny pinching philosophy.
Once again it depends on the workload. Alex has shown that this is the case for Y-cruncher SSE kernels, but Geekerwan has shown that for SSE SPECfp, there is a net improvement, so it depends It's important to note that most basic operations [add/substract/multiply/fma] had already 3 cycle latency for FP, so they are not affected by the latency increase. It's not like everything is worse, and people treat it as general regression...
Even if it were treated as a total regression, it's 1%. AVX 512 bit ops have a literal 99% upgrade with minimal latency. If anything, Zen 5 may surprisingly FineWine. We may see a great effort in compilers to massively improve AVX 512 usage in future programs.

It's kind of crazy to think that if you're right, the Z5 core is actually over heavy and it's the Z2 era uncore that is mostly/completely holding it back anymore. Well, except in FP.
 

poke01

Golden Member
Mar 8, 2022
1,995
2,534
106
From David


This reminds me of the outrageous rumors that some people have spread before, which eventually turned out to be wrong. They were not only slapped in the face, but also angrily claimed that Zen 5 was the worst architecture since the bulldozer. Readers who follow my Twitter may still remember that the PMC data I mentioned in the article actually began to be collected as early April. At that time, my purpose was to see how much work it took to achieve the performance improvement that some people boasted about. As a result, it was found that as long as you simply look at the PMC data, you can know that for the current x86 microarchitecture, it is simply a "dream" to achieve those outrageous rumored goals without sacrificing extreme frequency and extreme performance.

Therefore, my suggestion is to ignore those unreliable rumors, or to correct your expectations. Without a major breakthrough in the semiconductor process, I'm afraid that the CPU performance improvement in the next many years will only be this much. I will mention this sentence once after every chip manufacturer releases its product this year, because whether it is ARM's Cortex-X, the microarchitecture after Apple A13, or the x86 microarchitecture of AMD/Intel.
 

yuri69

Senior member
Jul 16, 2013
531
948
136
https://blog.hjc.im/zen-5-more-details-2.html 2nd part of David Huang's Strix Point analysis
So from the currently available performance counters it seems the main culprit of Zen 4 IPC is the frontend - L1I misses are horribly up and L2BTB overrides are... wtf. This means either the counters are broken or the frontend lags behind Zen 4.

C&C Zen 5 and Zen5c article.
C&C confirms that Zen 5 can't decode >4 instructions when running a single thread. This means the clustered decoder is way less 1T-capable than 2020 Tremont...
 

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,971
136
So from the currently available performance counters it seems the main culprit of Zen 4 IPC is the frontend - L1I misses are horribly up and L2BTB overrides are... wtf. This means either the counters are broken or the frontend lags behind Zen 4.
The increase in L1i misses in gcc and perlbench makes no sense. If it really was that bad the score would have been significantly lower, not better.

I can think of two hypotheses (warning, I'm at my first coffee...): they changed the way they count Icache misses, or the second group of decoders (which doesn't seem to work as many expected) is counting misses while doing nothing or being bugged.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,303
136
C&C confirms that Zen 5 can't decode >4 instructions when running a single thread. This means the clustered decoder is way less 1T-capable than 2020 Tremont...
If that was the case then Zen 5 wouldnt have 9-13% better ST IPC than RPL, so there s rather something broken in the measuring methodology than in the decoder capabilities.
 

dttprofessor

Member
Jun 16, 2022
53
13
51
And does that mean anything without a comparison of area of Lion Core to Zen 5?
N3 Zen 5C hasn't been measured yet, as far as I'm aware. I remain curious how you were able to state it so confidently.
Zen 5c is about 75% area of Zen 5, on stx(tsmc n4)
 

CouncilorIrissa

Senior member
Jul 28, 2023
520
1,995
96
The increase in L1i misses in gcc and perlbench makes no sense. If it really was that bad the score would have been significantly lower, not better.

I can think of two hypotheses (warning, I'm at my first coffee...): they changed the way they count Icache misses, or the second group of decoders (which doesn't seem to work as many expected) is counting misses while doing nothing or being bugged.
It's also that if it was indeed front-end that was the problem, we would see degradation in branchy workloads like web browsing, no? But that does not seem to be the case.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,966
136
If that was the case then Zen 5 wouldnt have 9-13% better ST IPC than RPL, so there s rather something broken in the measuring methodology than in the decoder capabilities.

The µop cache can serve more ops per cycle to single thread.
 
Reactions: yuri69

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,971
136
It's also that if it was indeed front-end that was the problem, we would see degradation in branchy workloads like web browsing, no? But that does not seem to be the case.
xz leela and deepsjeng have difficult to predict branches (string comparisons, evaluation functions in game trees) and they don't have improvements or even slight regressions.

It also seems we can't rule out an issue in the second group of 4 decoders.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,303
136
The µop cache can serve more ops per cycle to single thread.
The uops cache doesnt generate uops out of the vaccum, it still rely on the decoders decoded instructions flow to build its uops tables, if the decoding bandwith is not increased in respect of Zen 4 then uop cache related instructions flow wont increase and IPC will be stagnant.



 
Reactions: Vattila and Tlh97

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,971
136
The uop cache was changed on Zen5. According to David Huang's measurements the op cache brings 23% to Zen5 and 13% to Zen4. This demonstrates Zen5 relies much more on it, and the uarch around it changed significantly. So it's hard to conclude anything since you can't isolate things that impact performance.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,303
136
By the same token what Mike Clark stated could be right, that is, that the two decoders can be used by a single thread, FI if 4 instructions are decoded by decoder A then decoder B can decode the next four instructions.

It would be impossible to know wich decoder was used for each set of 4 instructions, all you would see is that 4 instructions of each row were decoded without knowing wich decoder did the work, that could be decoder A that did all the work or it could have been shared successively.
 

naukkis

Senior member
Jun 5, 2002
877
747
136
The uops cache doesnt generate uops out of the vaccum, it still rely on the decoders decoded instructions flow to build its uops tables, if the decoding bandwith is not increased in respect of Zen 4 then uop cache related instructions flow wont increase and IPC will be stagnant.

You got it wrong. When uop cache hits it will reuse already decoded instructions and decoders aren't needed at all. Actually cpus will shut down decoders completely when loops are 100% cached in uop cache. Decoders are only needed when uop cache won't hit, and when they hit even partiallly CPU IPC can grow above decoding capabilities.
 

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,971
136
@Abwx Yeah it's hard to know what's going on. There could really be a bug here when using the 2 groups of decoders in single thread. As far as I know no one has demonstrated the decoding of more than 4 instructions per cycle. But this might be due to microbenchmarks being unable to demonstrate that.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,303
136
You got it wrong. When uop cache hits it will reuse already decoded instructions and decoders aren't needed at all. Actually cpus will shut down decoders completely when loops are 100% cached in uop cache. Decoders are only needed when uop cache won't hit, and when they hit even partiallly CPU IPC can grow above decoding capabilities.
How much cycles before the decoder is put to use.?.
Instructions are flowing permanently from the instruction cache, so how much instructions can be picked from the uop cache before new instructions are required from the cache.?
Even if the uop cache can provide some instructions you ll still have to decode what will follow next, it s not like a full set of an app instructions can be provided by the uop cache.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |