Discussion Zen 5 Architecture & Technical discussion

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SarahKerrigan

Senior member
Oct 12, 2014
735
2,035
136
- 3 64 bits integer multipliers up from 1; that’s strange, these instructions rarely are a bottleneck (except outside of things like GMP, RSA); if this is confirmed, Zen5 might prove great for computational number theory.

I don't get it either. (I also don't get why ARM went from 2 to 4 imul's in X925.)

Profiling has shown over and over again that the vast majority of multiplies executed are by constants, which shouldn't be lowered to imul anyway. Relatively few of the ones that aren't by constants are perf-critical unless you're doing, like, integer digital signal processing.
 
Reactions: CouncilorIrissa

Nothingness

Diamond Member
Jul 3, 2013
3,031
1,973
136
I don't get it either. (I also don't get why ARM went from 2 to 4 imul's in X925.)

Profiling has shown over and over again that the vast majority of multiplies executed are by constants, which shouldn't be lowered to imul anyway. Relatively few of the ones that aren't by constants are perf-critical unless you're doing, like, integer digital signal processing.
There are some other applications in bitboard based game engines (though in these engines integer SIMD often is the way to go) and a low latency imul will always be faster than some integer lower cost operations.

But that’s a niche too.

Even RSA authentication is not a bottleneck for servers IIRC.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,966
136
- 3 64 bits integer multipliers up from 1; that’s strange, these instructions rarely are a bottleneck (except outside of things like GMP, RSA); if this is confirmed, Zen5 might prove great for computational number theory.

I don't get it either. (I also don't get why ARM went from 2 to 4 imul's in X925.)

This is absolutely not justified on the basis of common loads using that many parallel scalar muls. My best hypothesis is that it's about scheduling, the logic to pick an execution unit takes way more transistors than the execution units themselves, AMD has in the past had superfluous execution units to simplify scheduling. For example, K7 and K8 had more AGU units than could be served by the memory pipes, because this allowed them to treat the three "pipes" as identical in the scheduling logic.
 

Doug S

Platinum Member
Feb 8, 2020
2,711
4,602
136
This is absolutely not justified on the basis of common loads using that many parallel scalar muls. My best hypothesis is that it's about scheduling, the logic to pick an execution unit takes way more transistors than the execution units themselves, AMD has in the past had superfluous execution units to simplify scheduling. For example, K7 and K8 had more AGU units than could be served by the memory pipes, because this allowed them to treat the three "pipes" as identical in the scheduling logic.

Yes that makes perfect sense to me. ALUs have proliferated over time and become more heterogeneous, but the area cost of making them a little more homogeneous by giving them more capabilities (even when having that many units capable handling instruction type 'x' makes little sense) is so small now, why not? That's easier than having more restrictions on scheduling logic.

When I've seen the evolution of Apple Silicon's block diagrams from M1 to M4 I keep wondering "why are there more and more differences between the units, when it costs almost nothing to make them (close to) the same?" Maybe they're worried about the power cost of having more ungated transistors in active units or have some really snazzy scheduling logic that makes it not matter at all in their profiling, but it seems like increased homogeneity is a win for a tiny increase in core size.
 

naukkis

Senior member
Jun 5, 2002
878
755
136
This is absolutely not justified on the basis of common loads using that many parallel scalar muls. My best hypothesis is that it's about scheduling, the logic to pick an execution unit takes way more transistors than the execution units themselves, AMD has in the past had superfluous execution units to simplify scheduling. For example, K7 and K8 had more AGU units than could be served by the memory pipes, because this allowed them to treat the three "pipes" as identical in the scheduling logic.

K7/8 had basically three way clustered execution. every cluster needed every functionality to able operate. Maybe this odd increment in mult/div gives hints that physical register file is split similar way and every part of register file needs basic functionality to prevent need to move data between register files.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
I guess, this is what Mike Clark was so excited about already several years ago. Zen5 was not meant as one single generation to rule them all, but might be seen as a foundation for the coming years that might reap the benefits when process technology allows them to better use that new front-end. IMHO it is quite possible, they had to scale Zen5 back a bit due to N3 not being financially feasible.
I hope some of these folks who do microbenches (like David or Cheese) put up a new article once 9000 series are available.

According to David the bottleneck is somewhere down to cache/memory side which could explain some of the weaknesses of Zen 5, but of course only to unravel the next bottleneck.


BP seems to be very very good.
From his benchmark, I am wondering whether 2MiB L2 or even 1.5MiB L2 for Zen 5 would improve the situation for many apps with bigger hot code footprint.
2MiB of L2 would raise the Core area by roughly another mm2.

The only exception observed in the chart is the test with one branch per 64 bytes, where the latency spiked after exceeding 16384 branches. After calculation, it is easy to find that 16384 * 64B = 1 MB, which means that the code footprint has exceeded the L2 capacity when the latency increases.

Another thing is the unified scheduler for ALU caused some additional cycle in some cases but the advantage is that AMD could leverage a smaller PRF across 6 ALUs
Int PRF increase was a grand 16 entries only which is very surprising. whereas FP PRF doubled. Also no change in the number of available ALU scheduling entries (it decreased a bit actually but it is shared with AGU in Z4).

L2 to L3 still at 32B/cycle which is another factor. I thought this would finally be 64B/cycle



The 32B/cycle fabric which on fclk of around 2000MHz remains a key factor with latency, Strix probably have a much lower fabric clock I imagine, inter CCX latency is not addressed

I had hoped that at least L2 goes to 2M or increase the BW from L2 to L3, if it is 64B/cycle they could make do with 1M L2 and have access to the massive L3 including V Cache
 

LightningZ71

Golden Member
Mar 10, 2017
1,783
2,139
136
My humble observation: Combined with what AMD has publicly stated about Zen5's core, the front end was widened in preparation for future back end work. They probably ran out of area budget to add more on the back end given the process node they are using. I suspect that we'll see the back end get more fleshed out with Zen6 on N3 family nodes, though, it won't be as pronounced as some are expecting.
 

Gideon

Golden Member
Nov 27, 2007
1,765
4,114
136
My humble observation: Combined with what AMD has publicly stated about Zen5's core, the front end was widened in preparation for future back end work. They probably ran out of area budget to add more on the back end given the process node they are using. I suspect that we'll see the back end get more fleshed out with Zen6 on N3 family nodes, though, it won't be as pronounced as some are expecting.
IMHO feeding the beast (e.g. the cache hierarchy, etc) is a much bigger bottleneck. It's not like they didn't widen the backend (though there definitely is more room looking how Zen 1 -> 4 evolved)
 
Reactions: Vattila

LightningZ71

Golden Member
Mar 10, 2017
1,783
2,139
136
I think that they will keep the core and out to L2 similar to their current concept for Zen6. L3 and whatever their Last Layer cache looks like will evolve a bit though.
 
Reactions: Tlh97 and Gideon

FlameTail

Diamond Member
Dec 15, 2021
3,771
2,224
106
L2 cache size will increase I think. 1 MB is unchanged from Zen4.

For comparison, the ARM Cortex X925 can be configured with upto 3 MB if private L2.
Intel's cores also have larger L2s iirc.
 

yuri69

Senior member
Jul 16, 2013
531
951
136
Balooning the L2 doesn't work as Willow and Zen 4 show.

It was sad to see no update to the fabric config with Zen 4. With Zen 5 it's just painful to wait for Zen 6.
 

jdubs03

Senior member
Oct 1, 2013
700
315
136
Updated SIR 2017 results for STX by David Huang.
Translated:

“Updated Ryzen AI 9 HX 370 (the name is really hard to pronounce) big/small core test results. The big core can touch the level of M2, and the small core is at the level of 8cx gen 3 big core. The cache capacity of these two groups is similar, but ARM has some SLC

The content tested so far does not seem to be enough to fill an article, so I’ll slowly put it together after the desktop is released... Actually, PMC is quite surprising, and it will take some time to analyze.“
 
Last edited:

DavidC1

Senior member
Dec 29, 2023
782
1,241
96
Translated:

“Updated Ryzen AI 9 HX 370 (the name is really hard to pronounce) big/small core test results. The big core can touch the level of M2, and the small core is at the level of 8cx gen 3 big core. The cache capacity of these two groups is similar, but ARM has some SLC
I dislike laptop benchmarking, especially when trying to figure out uarch differences.

Why does Huang get such a different result from Geekerwan? Geekerwan tests showed the AMD E core noticeably slower than their P core, and Huang shows the opposite. This isn't the first time.

This is why it needs to be released and the tests done properly on a DESKTOP platform. Anandte....(sorry didn't meant to beat a dead horse) HWUnboxed, TechPowerup, GamersNexus, Computerbase, etc.
 

MS_AT

Senior member
Jul 15, 2024
210
505
96
I dislike laptop benchmarking, especially when trying to figure out uarch differences.

Why does Huang get such a different result from Geekerwan? Geekerwan tests showed the AMD E core noticeably slower than their P core, and Huang shows the opposite. This isn't the first time.

This is why it needs to be released and the tests done properly on a DESKTOP platform. Anandte....(sorry didn't meant to beat a dead horse) HWUnboxed, TechPowerup, GamersNexus, Computerbase, etc.
Where do you see the opposite? Zen5 mobile core scores 9.72, Zen5c mobile core scores 6.45?
 

gdansk

Platinum Member
Feb 8, 2011
2,843
4,239
136
In-depth core analysis from Y-Cruncher author (Alexander J. Yee) http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/
From the developer standpoint, what this means is that there quite literally is no penalty for using AVX512 on Zen5. So every precaution and recommendation against AVX512 that has been built up over the years on Intel should be completely reversed on Zen5 (as well as Zen4). Do not hold back on AVX512. Go ahead and use that 512-bit memcpy() in otherwise scalar code. Welcome to AMD's world.
It is possible AVX512 might matter soon enough.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
In-depth core analysis from Y-Cruncher author (Alexander J. Yee) http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/
Looks like a heavy memory bottleneck issue as found by David as well.
Similarly, the regression in 1cycle SIMD instructions also noticed by David. But seems this is a new hazard that could be resolved not some major architectural limitation in future uarchs
Heavy focus on AVX512 while no focus made to 128 bit and 256 bit instructions.

Likely that another 1MiB of L2 would have helped to get some extra perf, it is not an overly thirsty chip. But the fabric really needs an overhaul.
 

CouncilorIrissa

Senior member
Jul 28, 2023
521
2,002
96
Looks like a heavy memory bottleneck issue as found by David as well.
Similarly, the regression in 1cycle SIMD instructions also noticed by David. But seems this is a new hazard that could be resolved not some major architectural limitation in future uarchs
Heavy focus on AVX512 while no focus made to 128 bit and 256 bit instructions.

Likely that another 1MiB of L2 would have helped to get some extra perf, it is not an overly thirsty chip. But the fabric really needs an overhaul.
1 MiB of L2 is so not cheap in terms of area, though.
 

MS_AT

Senior member
Jul 15, 2024
210
505
96
It is possible AVX512 might matter soon enough.
Unfortunately, it's unlikely as most of the market is AVX2 and Intel will try to push AVX10/256 as much as possible because of their E-cores so unless AMD manages quickly to grab a lot of marketshare I don't expect current status quo to change in any meaningful way.
Heavy focus on AVX512 while no focus made to 128 bit and 256 bit instructions.
Improving that [doubling execution units] would require much greater silicon expenditure, because they would need to double the amount of ports from register file [it seems that they believe 10 is the optimum under current constraints as they stick with it in Zen4, Zen5, not sure about Zen3]. Generally from core point of view wider vectors are better than more smaller ones as you need to decode less instructions [doesn't matter so much for ARM where instruction length is fixed] to do the same amount of work. But it's harder to code for, and thanks to AVX512 fragmentation no consumer HW, except for Zen4/5 and Intel 11th Gen has it what makes TAM rather small.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |