Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 782 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
19,595
13,435
146
They also tested the 9600X and got the same results, i noticed that they used, among others, Xplane where Zen 5 does very well, but also F1 22, Far Cry, Final Fantasy, Dota 2, Strange Brigade, Metro Exodus, F1 2020, all games where Zen 5 does also very well.
Thanks for pointing that out.
 
Reactions: Abwx
Jul 27, 2020
19,595
13,435
146
Not to mention Cinebench instruction and data flow might be different that SPEC average. Maybe you should try to find spec subset that correlates best.
We need some developer to go really deep into the instruction set abyss with multiple experimental self created benchmarking software to figure out what's going on.

The generally available stuff everyone is using has just been a source of confusion so far. Zen 5 seems like a totally brand new animal in terms of how it behaves with software.
 

inf64

Diamond Member
Mar 11, 2011
3,863
4,535
136
I see a lot of doom and gloom on the web regarding Ryzen 9000 and Windows. The chips are indeed a meh upgrade for gamers and that point stands. But for productivity, with PBO enabled they do offer nice uplifts in a range of workloads. This doesn't mean AMD is forgiven of any wrongdoings (which were too many) with this launch. They need to present true/correct performance numbers in the future and plan better (maybe wait for X3D and launch it in parallel? )
 

StefanR5R

Elite Member
Dec 10, 2016
5,885
8,746
136
Zen 5 seems like a totally brand new animal in terms of how it behaves with software.
I don't think that Zen 5 is particularly special in this regard. Phenomena such as
– INT math and outmoded forms of FP math and pointer chasing software not scaling as great on newest CPUs as we'd wish,
– game FPS being largely independent of CPU performance, causing certain actors to make a big deal out of ‰ differences,
– secret societies suspected to rewrite benchmark software to favor one vendor over another​
(to pick just a small random selection of discussion points of this and similar threads) have been observed for how many CPU generations now?
 
Jul 27, 2020
19,595
13,435
146
(to pick just a small random selection of discussion points of this and similar threads) have been observed for how many CPU generations now?
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)
 

tsamolotoff

Member
May 19, 2019
170
301
136
that it is much colder. what am i doing wrong?
ES CPUs had different V/F curve, as far as I understand, also with PBO the CPU will always boost to the max temperature or voltage/current limits (and PBO is very inefficient so if you manually OC with static clocks and voltage, the frequency is usually higher than PBO provides while temperatures / power dissipation are also lower)
 

rydeon95

Junior Member
Aug 13, 2024
8
2
36
Le CPU ES hanno una curva V/F diversa, per quanto ne so, inoltre con PBO la CPU aumenterà sempre fino alla temperatura massima o ai limiti di tensione/corrente (e PBO è molto inefficiente, quindi se si esegue manualmente l'OC con clock e tensione statici, la frequenza è solitamente più alta di quella fornita da PBO mentre le temperature/dissipazione di potenza sono anche più basse)
I expected a better temperature improvement but that's fine, the only strange thing is IOD HOTSPOT a little high, about 43 degrees, with the 7950x3D I was under 40 degrees
 

Abwx

Lifer
Apr 2, 2011
11,514
4,299
136
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about.
5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.

Zen 2 > Zen 3 (Wow!)

Good numbers in Spec but only 12-13% at Computerbase along with
200-300Mhz uplift, what was remarkable was the quite better perfs in games,
no wonder that it was aclaimed by the usual gaming crowd that is nowadays
vociferating on Zen 5.


Zen 3 -> Zen 4 (Impressive!)

Also 12% IPC at CB, along with a huge Fmax uplift thanks to a new node.

Zen 4 -> Zen 5 (WTF???)

16% IPC with a minimaly better process and no possibility of higher frequency.
Actually the process cant even allow to exploit the better IPC since its 11% perf/Watt improvement vs N5P is less than the IPC uplift.
 

yuri69

Senior member
Jul 16, 2013
530
944
136
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)
* Zen -> Zen+ - proved AMD can iterate and bugfix things
* Zen+ -> Zen 2 - brought the awesome disaggregated technology; massive 64c servers; substantial core performance uplift but still not reaching Intel
* Zen 2-> Zen 3 - reached peak performance for the whole CCD; L3 helped interactive workloads; server stagnated a bit at 64c
* Zen 3 -> Zen 4 - brought marginal IPC gains with a very strong focus on frequency; temp/TDP reached the top; enabled AVX512; 128c servers dominated
* Zen 4 -> Zen 5 - brought marginal IPC gains with no possible frequency/TDP gains; full AVX512; servers ???
 
Last edited:

Rheingold

Member
Aug 17, 2022
55
150
76
Zen and Zen+ I don't know much about
5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.
The most important change was the inclusion of Precision Boost 2 which was first introduced for the Raven Ridge APUs. Where the 1800X has a single hard boost step from 4.1 GHz* down to 3.7 GHz for more than two active threads, the 2700X has the gradual opportunistic boost that's still used today. This results in more than 10% frequency uplift in a wide range of applications.



* The diagram shows 4.0 GHz, but there is a feature called "XFR Boost" that can provide another 100 MHz on top.
 
Last edited:

Josh128

Senior member
Oct 14, 2022
272
391
96
5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.



Good numbers in Spec but only 12-13% at Computerbase along with
200-300Mhz uplift, what was remarkable was the quite better perfs in games,
no wonder that it was aclaimed by the usual gaming crowd that is nowadays
vociferating on Zen 5.




Also 12% IPC at CB, along with a huge Fmax uplift thanks to a new node.



16% IPC with a minimaly better process and no possibility of higher frequency.
Actually the process cant even allow to exploit the better IPC since its 11% perf/Watt improvement vs N5P is less than the IPC uplift.
Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.
 

Abwx

Lifer
Apr 2, 2011
11,514
4,299
136
Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.

I m talking of the average with 13 softwares on MT, it s 12% for CB R15 and 13% for CB R20 wich is the same as R23, most impressive is the 12% with 7 Zip as INT is way more difficult to improve than FP.

 
Last edited:

biostud

Lifer
Feb 27, 2003
18,603
5,300
136
* Zen -> Zen+ - proved AMD can iterate and bugfix things
* Zen+ -> Zen 2 - brought the awesome disaggregated technology; massive 64c servers; substantial core performance uplift but still not reaching Intel
* Zen 2-> Zen 3 - reached peak performance for the whole CCD; L3 helped interactive workloads; server stagnated a bit at 64c
* Zen 3 -> Zen 4 - brought marginal IPC gains with a very strong focus on frequency; temp/TDP reached the top; enabled AVX512; 128c servers dominated
* Zen 4 -> Zen 5 - brought marginal IPC gains with no possible frequency/TDP gains; full AVX512; servers ???
Not to mention the zen 3 got upgraded with 3D vcache.
 

Mahboi

Senior member
Apr 4, 2024
957
1,721
96
There is an oddity I can't put my finger on with Zen 5.
Right now we're speculating broadly that it needs high data throughput to get fully used. (at least I am)
INT isn't doing so good while FP/SIMD is stellar.

So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?
 
Reactions: carancho

GTracing

Member
Aug 6, 2021
78
191
76
can someone give me a quick rundown of the dual decoders situation
With Zen5, AMD went from one 4 wide decoder to two 4 wide decoders. They also widened most other parts of the core. On paper it's the biggest change since Zen1. Unfortunately, the real world gains do not live up to the hype, and in some workloads are practically non-existant. No one knows why, but there are a few theories. It's probably some combination.
  • Chips and Cheese says in their strix point review that one thread can't use both decoders. This directly contradicts what AMD's Mike Clark has said, but if this is true then the whole core is bottlenecked by decode.
  • AMD removed some features in Zen5 (I believe macro-op fusion was one of them?).
  • Zen5 is seemingly bandwidth starved which could affect it's performance in some workloads. Zen6 is rumored to be a huge change to the cache structure/ infinity fabric/ uncore.
  • It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.
 
Last edited:

DavidC1

Senior member
Dec 29, 2023
776
1,230
96
So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?
Compiling is Integer workload, and that's where it's hard to improve because it's sensitive to everything. It would be very sensitive to latency of instructions, having fast caches and communication.

I would not say Browser benches are fast either. It's only true for the lower end parts, where architectural choices made to enhance the high end often disproportionately benefit the low end because they are often artificially segmented. 9950X shows almost no gains in Jetstream: https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/6

Look how there are cases where the 9950X degrades but it doesn't for 9700X.

Northwood P4 was ok mostly due to Intel's dominant process technology but the Celeron Northwood sucked, mostly because of the 1/4 cache(128KB vs 512KB). But Prescott P4 sucked, while Prescott Celeron improved greatly. It was 20-30% faster than the predecessor, because the affect of artificial L2 cache limitation was lifted by Prescott's better memory ILP and buffers.
  • It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.
While we want Apples to Apples comparisons, in reality that is impossible.

Implementation always trumps paper level differences. In the big picture they have many more people working on ARM parts, with better opportunities, choices, and likely even pay so you have the best people working there.

I'll give you examples. Trace Cache was used to overcome this limitation on Netburst but the overall chip was trash. But the idea evolved into a much more efficient one in Sandy Bridge. It is called a micro op cache(uop cache) and it's still used today with both Intel and AMD parts. Implementation trumps theoreticals.

Zen 5 can barely use the dual decoder clusters but 4-year old Tremont can. Gracemont nearly perfects it by eliminating few cases where branches are needed for parallel execution, and Skymont further enhances it.

To programs, Gracemont’s decoder should function just like a traditional 6-wide decoder, fed by a 32 byte/cycle fetch port which goes both ways and compared to old school linear decoders, a clustered out-of-order decoder should lose less throughput around taken branches.
That's why I noticed that Moore's Law and computing enhancements always favor lower power and smaller sizes over outright performance gains. What some might call "democratization" of computing, allowing the average people access to some powerful stuff. Like in super poor countries that have no food to eat but Smartphones with access to wireless internet.
 
Last edited:

MS_AT

Member
Jul 15, 2024
188
414
91

So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?
You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beat

And general note , when you say INT you usually mean scalar workloads. As SIMD unit is very much capable of running INTeger operations... The thing is that scalar INT ops are using different backend than SIMD INT ops. Scalar FP ops are using the same backend as SIMD FP ops and SIMD INT ops. But I think it's similar story to MHz and MT/s with RAM. The second one is the correct term, but people are used to the first one but then it's confusing as it seems that people think the SIMD unit is incapable of doing INT ops.

Chips and Cheese says in their strix point review that one thread can't use both decoders. This directly contradicts what AMD's Mike Clark has said, but if this is true then the whole core is bottlenecked by decode.
Profiling data gathered in Zen5 review, Zen4 reviews before show that lion share of ops is served from uops cache so 4 wide decode doesn't have to be a bottleneck. Also it might be that the second decoder is used in ST mode but so rarely that it's not worth mentioning according to C&C comments in 9950X review.
AMD removed some features in Zen5 (I believe macro-op fusion was one of them?).
They removed nop fusion not all fusions.
Zen5 is seemingly bandwidth starved which could affect it's performance in some workloads. Zen6 is rumored to be a huge change to the cache structure/ infinity fabric/ uncore.
Only in MT SIMD workloads
It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.
Yes, x64 is at disadvantage and they simply cannot spam decoders like ARM counterparts can. But the x64 answer seems to be decode clusters and uop caches. It's also why x64 like beefier SIMD units, you can do more work with smaller instruction footprint.
I would not say Browser benches are fast either. It's only true for the lower end parts, where architectural choices made to enhance the high end often disproportionately benefit the low end because they are often artificially segmented. 9950X shows almost no gains in Jetstream: https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/6
Are those benchmarks multithreaded to show gains in 9950X?
Zen 5 can barely use the dual decoder clusters but 4-year old Tremont can. Gracemont nearly perfects it by eliminating few cases where branches are needed for parallel execution, and Skymont further enhances it.
The difference is Skymont doesn't have SMT nor uop cache. Zen4 was able to keep up with Raptor Lake having 6wide decode while itself it had only 4wide. Might be that Skymont way will show the future, but so far even Intel is treading carefully as LionCove is using 8 wide decode instead of clustered approach. The reason I mention SMT is that it might be more difficult to implement clustered decode and SMT at the same time and its AMD first implementation. Gracemont that was second or third one. Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.
 

Mahboi

Senior member
Apr 4, 2024
957
1,721
96
You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beat
Got a good recommendation for me?
 

DavidC1

Senior member
Dec 29, 2023
776
1,230
96
And general note , when you say INT you usually mean scalar workloads.
Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.
Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.
Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.

The two are different teams and Intel plays internal politics, thus have very different approaches and ideologies. If Intel survives, the E core team is the future.
 

MS_AT

Member
Jul 15, 2024
188
414
91
Got a good recommendation for me?
I would recommend stuff from Matt Godbolt and Chandler Carruth about compiler optimizations. Also talks about how compilers can use undefined behaviour in C++ to optimize code and how it can backfire but those might be too specific Although a warning I am unable to tell how approachable they are if you don't have already some background in software.
Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.

The two are different teams and Intel plays internal politics, thus have very different approaches and ideologies. If Intel survives, the E core team is the future.
I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.
 
Reactions: igor_kavinski
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |