- Mar 3, 2017
- 1,747
- 6,598
- 136
Honestly, the way he spoke, it seemed like I was listening to Joker. Yes, THAT Joker. THE JOKER.It’s not his fault they woke him up from his dream too early.
Let him close his eyes, go to sleep, and wake up in another 3 years and try again.
Thanks for pointing that out.They also tested the 9600X and got the same results, i noticed that they used, among others, Xplane where Zen 5 does very well, but also F1 22, Far Cry, Final Fantasy, Dota 2, Strange Brigade, Metro Exodus, F1 2020, all games where Zen 5 does also very well.
We need some developer to go really deep into the instruction set abyss with multiple experimental self created benchmarking software to figure out what's going on.Not to mention Cinebench instruction and data flow might be different that SPEC average. Maybe you should try to find spec subset that correlates best.
I don't think that Zen 5 is particularly special in this regard. Phenomena such asZen 5 seems like a totally brand new animal in terms of how it behaves with software.
Yeah but this is the first confusing generation of Zen.(to pick just a small random selection of discussion points of this and similar threads) have been observed for how many CPU generations now?
In many power-limited workloads (think Rome to Milan) not exactly Wow. Although still good considering same manufacturing node and almost same SoC topology.Zen 2 > Zen 3 (Wow!)
WTF from the narrow view of consumer-level client computing maybe.Zen 4 -> Zen 5 (WTF???)
ES CPUs had different V/F curve, as far as I understand, also with PBO the CPU will always boost to the max temperature or voltage/current limits (and PBO is very inefficient so if you manually OC with static clocks and voltage, the frequency is usually higher than PBO provides while temperatures / power dissipation are also lower)that it is much colder. what am i doing wrong?
I expected a better temperature improvement but that's fine, the only strange thing is IOD HOTSPOT a little high, about 43 degrees, with the 7950x3D I was under 40 degreesLe CPU ES hanno una curva V/F diversa, per quanto ne so, inoltre con PBO la CPU aumenterà sempre fino alla temperatura massima o ai limiti di tensione/corrente (e PBO è molto inefficiente, quindi se si esegue manualmente l'OC con clock e tensione statici, la frequenza è solitamente più alta di quella fornita da PBO mentre le temperature/dissipazione di potenza sono anche più basse)
5% better IPC according to Computerbase charts and 350MHz higherYeah but this is the first confusing generation of Zen.
Zen and Zen+ I don't know much about.
Zen 2 > Zen 3 (Wow!)
Zen 3 -> Zen 4 (Impressive!)
Zen 4 -> Zen 5 (WTF???)
Zen 2 > Zen 3 (Wow!)
That's obviously more due to the L3 cache changes, less due to microarchitectural changes. (Larger and fewer L3 cache domains.)what was remarkable was the quite better perfs in games,
* Zen -> Zen+ - proved AMD can iterate and bugfix thingsYeah but this is the first confusing generation of Zen.
Zen and Zen+ I don't know much about. BUT
Zen 2 > Zen 3 (Wow!)
Zen 3 -> Zen 4 (Impressive!)
Zen 4 -> Zen 5 (WTF???)
Zen and Zen+ I don't know much about
The most important change was the inclusion of Precision Boost 2 which was first introduced for the Raven Ridge APUs. Where the 1800X has a single hard boost step from 4.1 GHz* down to 3.7 GHz for more than two active threads, the 2700X has the gradual opportunistic boost that's still used today. This results in more than 10% frequency uplift in a wide range of applications.5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.
No wonder it's called cinememe...
Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.
Good numbers in Spec but only 12-13% at Computerbase along with
200-300Mhz uplift, what was remarkable was the quite better perfs in games,
no wonder that it was aclaimed by the usual gaming crowd that is nowadays
vociferating on Zen 5.
Also 12% IPC at CB, along with a huge Fmax uplift thanks to a new node.
16% IPC with a minimaly better process and no possibility of higher frequency.
Actually the process cant even allow to exploit the better IPC since its 11% perf/Watt improvement vs N5P is less than the IPC uplift.
Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.
Not to mention the zen 3 got upgraded with 3D vcache.* Zen -> Zen+ - proved AMD can iterate and bugfix things
* Zen+ -> Zen 2 - brought the awesome disaggregated technology; massive 64c servers; substantial core performance uplift but still not reaching Intel
* Zen 2-> Zen 3 - reached peak performance for the whole CCD; L3 helped interactive workloads; server stagnated a bit at 64c
* Zen 3 -> Zen 4 - brought marginal IPC gains with a very strong focus on frequency; temp/TDP reached the top; enabled AVX512; 128c servers dominated
* Zen 4 -> Zen 5 - brought marginal IPC gains with no possible frequency/TDP gains; full AVX512; servers ???
With Zen5, AMD went from one 4 wide decoder to two 4 wide decoders. They also widened most other parts of the core. On paper it's the biggest change since Zen1. Unfortunately, the real world gains do not live up to the hype, and in some workloads are practically non-existant. No one knows why, but there are a few theories. It's probably some combination.can someone give me a quick rundown of the dual decoders situation
Compiling is Integer workload, and that's where it's hard to improve because it's sensitive to everything. It would be very sensitive to latency of instructions, having fast caches and communication.So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?
While we want Apples to Apples comparisons, in reality that is impossible.
- It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.
That's why I noticed that Moore's Law and computing enhancements always favor lower power and smaller sizes over outright performance gains. What some might call "democratization" of computing, allowing the average people access to some powerful stuff. Like in super poor countries that have no food to eat but Smartphones with access to wireless internet.To programs, Gracemont’s decoder should function just like a traditional 6-wide decoder, fed by a 32 byte/cycle fetch port which goes both ways and compared to old school linear decoders, a clustered out-of-order decoder should lose less throughput around taken branches.
You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beatThe AMD Ryzen 7 9700X and Ryzen 5 9600X Review: Zen 5 is Alive
www.anandtech.com
So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?
Profiling data gathered in Zen5 review, Zen4 reviews before show that lion share of ops is served from uops cache so 4 wide decode doesn't have to be a bottleneck. Also it might be that the second decoder is used in ST mode but so rarely that it's not worth mentioning according to C&C comments in 9950X review.Chips and Cheese says in their strix point review that one thread can't use both decoders. This directly contradicts what AMD's Mike Clark has said, but if this is true then the whole core is bottlenecked by decode.
They removed nop fusion not all fusions.AMD removed some features in Zen5 (I believe macro-op fusion was one of them?).
Only in MT SIMD workloadsZen5 is seemingly bandwidth starved which could affect it's performance in some workloads. Zen6 is rumored to be a huge change to the cache structure/ infinity fabric/ uncore.
Yes, x64 is at disadvantage and they simply cannot spam decoders like ARM counterparts can. But the x64 answer seems to be decode clusters and uop caches. It's also why x64 like beefier SIMD units, you can do more work with smaller instruction footprint.It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.
Are those benchmarks multithreaded to show gains in 9950X?I would not say Browser benches are fast either. It's only true for the lower end parts, where architectural choices made to enhance the high end often disproportionately benefit the low end because they are often artificially segmented. 9950X shows almost no gains in Jetstream: https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/6
The difference is Skymont doesn't have SMT nor uop cache. Zen4 was able to keep up with Raptor Lake having 6wide decode while itself it had only 4wide. Might be that Skymont way will show the future, but so far even Intel is treading carefully as LionCove is using 8 wide decode instead of clustered approach. The reason I mention SMT is that it might be more difficult to implement clustered decode and SMT at the same time and its AMD first implementation. Gracemont that was second or third one. Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.Zen 5 can barely use the dual decoder clusters but 4-year old Tremont can. Gracemont nearly perfects it by eliminating few cases where branches are needed for parallel execution, and Skymont further enhances it.
Got a good recommendation for me?You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beat
Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.And general note , when you say INT you usually mean scalar workloads.
Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.
I would recommend stuff from Matt Godbolt and Chandler Carruth about compiler optimizations. Also talks about how compilers can use undefined behaviour in C++ to optimize code and how it can backfire but those might be too specific Although a warning I am unable to tell how approachable they are if you don't have already some background in software.Got a good recommendation for me?
I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.
The two are different teams and Intel plays internal politics, thus have very different approaches and ideologies. If Intel survives, the E core team is the future.