- Mar 3, 2017
- 1,747
- 6,598
- 136
Yeah, Im familiar with that, but that is not AMD saying they will fix it. Thats a rumor from some unknown Asian person on X.
note to self: svgz images won't work.
I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.
I didn't say it was, just posted a link with some information. This is a speculation thread.Yeah, Im familiar with that, but that is not AMD saying they will fix it. Thats a rumor from some unknown Asian person on X.
note to self: svgz images won't work.
I mean it looks stylish to see an 8 core Zen 5 beat Intel's best...but let's be honest nobody should care.
It's 50% more power draw on both the 6 and 8 cores. For a growth level of 10% at best.
// Models 00h-0Fh (Breithorn).
// Models 10h-1Fh (Breithorn-Dense).
// Models 20h-2Fh (Strix 1).
// Models 30h-37h (Strix 2).
// Models 38h-3Fh (Strix 3).
// Models 40h-4Fh (Granite Ridge).
// Models 50h-5Fh (Weisshorn).
// Models 60h-6Fh (Krackan1).
// Models 70h-77h (Sarlak).
Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.[X86] AMD Zen 5 Initial enablement by ganeshgit · Pull Request #107964 · llvm/llvm-project
This patch enables the basic skeleton enablement of AMD next gen zen5 CPUs.github.com
Some new codenames for Zen 5 CPUs
Overall yes, Zen 5 is fine.What I highlight from this chart is that 9600X at the same TDP is 29% faster than 7600X, or 9700X at the same TDP is 23% faster than 7700X. In my view thats a total success for a new gen.
I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.
AMD - The Software Company!Yay finally, maybe they will make it to clang 19 with real tunings instead of copy/paste from zen4.
Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.I am genuinely veeeeery curious about possible compiler improvements in the upcoming year.
How much can that split frontend improve? Everyone seems to agree that it's still misused/mostly unused...how far could compiler upgrades help it, especially in scalar?
Well servers are also used for compiling Problem with compilation is that poor build system can kill the cpu scaling so the actual performance improvement might be tad higher. I have heard about 15%. You could call it splitting hairs, but wellOverall yes, Zen 5 is fine.
But it's notably poor for scalar/INT and greatly relies on SIMD for its performance increase. Games get 5%, compilation 10. It's strongly carried by interpreters and massive data crunchers.
A server gen, through and through.
Maybe they're going to use the speakers for active noise cancellation of the fanMay I interrupt you for a new interpretation of the miniPC
Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point) - Liliputing
Beelink SER9 is one of the first mini PCs with Ryzen AI 300 (Strix Point)liliputing.com
I mean... wtf??? Speakers? )) Other than that, nice design, if it's anything like the previous model it's really quiet but.... again... speakers? In that form factor? OMG...
Speaking of that, AMD sounds like they're hiding something under the carpet.Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.
In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC
It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.
As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?
The TLR I remember is that the 2x 4 decoder is only fully utilized in SMT (2 threads per core)Speaking of that, AMD sounds like they're hiding something under the carpet.
- C&C found that actually AMD's statements regarding double decoder throughput isn't applied with SMT being kind of not fully working (yes vague, I forgot the details)
- David Huang also found something of that extent IIRC
It seems like they're claiming that the dual decoder is running, but kinda sorta ackshually not. Maybe it's just me misunderstanding complex articles, but it feels like we could hear about some microcode update at some point that somehow gained 5% perf in scalar apps.
As for latency bound...I honestly don't even understand how that works. Decode finishes its job and passes it to the backend and somehow it's "late"?? Doesn't it all go stage after stage until retire? What's creating latency from within the CPU once decode is done?
Well if you want an hours-long explanation I can recommend this video starting from -29 minutes"A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location.[1]"
Thanks so much for the noob explanation.
every day, the acronyms stray further from God
Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.In layman terms: it doesn't matter whether you decode 4 or 14 instructions if you often have to wait to "get" even the first instruction. E.g. you are waiting for the branch predictor to decide due its deep structures, or you were unlucky and TLB doesn't hold the needed address - you wait.
Yeah, aarch64's 64k page size apparently brings ~15% gain on AmpereOne in the Phoronix test suite: https://www.phoronix.com/review/ampereone-64k-linux611/5min page size = 16k
...
Bigger page sizes just seems to win these days , less TLB misses , easier L1D cache.
thats interesting i cant find any details on AmpereOnes L1D other then its 64k and ~ 4 cycle access so its hard to extrapolate if that 15% is just from better TLB usage of 64k page or if the L1D/TLB has some bottlenecks at 4k page size.Yeah, aarch64's 64k page size apparently brings ~15% gain on AmpereOne in the Phoronix test suite: https://www.phoronix.com/review/ampereone-64k-linux611/5
I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]Also the uop cache is having an insanely high hit rate that it doesn't seem to matter all that much.
Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from memory. So personally I am curious if L1i cache growth would benefit here and why they haven't bumped this size since Zen2? I think Intel is using at least 64kB I cache not to mention gigantic L1i caches on leading ARM uarchs. [192kB? iirc]
Also for gaming workloads I recommend this analysis from C&C for Zen4 https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ that partially explains why Zen5 doesn't do improve in gaming by a lot.
It's different in what it stores and that it's higher throughput but the overall mechanism of getting it filled looks the same from comparing Software Optimization Guides so if either one misses it will need to wait unless I have misunderstood something.Zen 5's uop cache is significantly different from zen4's, i wouldn't draw assumptions / extrapolations , they did some profiling but not gaming specific.
AMD’s Ryzen 9950X: Zen 5 on Desktop
AMD’s desktop Zen 5 products, codenamed Granite Ridge, are the latest in the company’s line of high performance consumer offerings. Here, we’ll be looking at AMD’s Ryzen 9 9…chipsandcheese.com