Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

igor_kavinski · Aug 19, 2024

yottabit said:
It’s not his fault they woke him up from his dream too early.

Let him close his eyes, go to sleep, and wake up in another 3 years and try again.

Honestly, the way he spoke, it seemed like I was listening to Joker. Yes, THAT Joker. THE JOKER.

igor_kavinski · Aug 19, 2024

Abwx said:
They also tested the 9600X and got the same results, i noticed that they used, among others, Xplane where Zen 5 does very well, but also F1 22, Far Cry, Final Fantasy, Dota 2, Strange Brigade, Metro Exodus, F1 2020, all games where Zen 5 does also very well.

Thanks for pointing that out.

igor_kavinski · Aug 19, 2024

MS_AT said:
Not to mention Cinebench instruction and data flow might be different that SPEC average. Maybe you should try to find spec subset that correlates best.

We need some developer to go really deep into the instruction set abyss with multiple experimental self created benchmarking software to figure out what's going on.

The generally available stuff everyone is using has just been a source of confusion so far. Zen 5 seems like a totally brand new animal in terms of how it behaves with software.

inf64 · Aug 19, 2024

I see a lot of doom and gloom on the web regarding Ryzen 9000 and Windows. The chips are indeed a meh upgrade for gamers and that point stands. But for productivity, with PBO enabled they do offer nice uplifts in a range of workloads. This doesn't mean AMD is forgiven of any wrongdoings (which were too many) with this launch. They need to present true/correct performance numbers in the future and plan better (maybe wait for X3D and launch it in parallel? )

StefanR5R · Aug 19, 2024

igor_kavinski said:
Zen 5 seems like a totally brand new animal in terms of how it behaves with software.

I don't think that Zen 5 is particularly special in this regard. Phenomena such as

– INT math and outmoded forms of FP math and pointer chasing software not scaling as great on newest CPUs as we'd wish,
– game FPS being largely independent of CPU performance, causing certain actors to make a big deal out of ‰ differences,
– secret societies suspected to rewrite benchmark software to favor one vendor over another

(to pick just a small random selection of discussion points of this and similar threads) have been observed for how many CPU generations now?

igor_kavinski · Aug 19, 2024

StefanR5R said:
(to pick just a small random selection of discussion points of this and similar threads) have been observed for how many CPU generations now?

Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)

StefanR5R · Aug 19, 2024

igor_kavinski said:
Zen 2 > Zen 3 (Wow!)

In many power-limited workloads (think Rome to Milan) not exactly Wow. Although still good considering same manufacturing node and almost same SoC topology.

igor_kavinski said:
Zen 4 -> Zen 5 (WTF???)

WTF from the narrow view of consumer-level client computing maybe.

tsamolotoff · Aug 19, 2024

rydeon95 said:
that it is much colder. what am i doing wrong?

ES CPUs had different V/F curve, as far as I understand, also with PBO the CPU will always boost to the max temperature or voltage/current limits (and PBO is very inefficient so if you manually OC with static clocks and voltage, the frequency is usually higher than PBO provides while temperatures / power dissipation are also lower)

rydeon95 · Aug 19, 2024

tsamolotoff said:
Le CPU ES hanno una curva V/F diversa, per quanto ne so, inoltre con PBO la CPU aumenterà sempre fino alla temperatura massima o ai limiti di tensione/corrente (e PBO è molto inefficiente, quindi se si esegue manualmente l'OC con clock e tensione statici, la frequenza è solitamente più alta di quella fornita da PBO mentre le temperature/dissipazione di potenza sono anche più basse)

I expected a better temperature improvement but that's fine, the only strange thing is IOD HOTSPOT a little high, about 43 degrees, with the 7950x3D I was under 40 degrees

Abwx · Aug 19, 2024

igor_kavinski said:
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about.

5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.

igor_kavinski said:
Zen 2 > Zen 3 (Wow!)

Good numbers in Spec but only 12-13% at Computerbase along with
200-300Mhz uplift, what was remarkable was the quite better perfs in games,
no wonder that it was aclaimed by the usual gaming crowd that is nowadays
vociferating on Zen 5.

igor_kavinski said:
Zen 3 -> Zen 4 (Impressive!)

Also 12% IPC at CB, along with a huge Fmax uplift thanks to a new node.

igor_kavinski said:
Zen 4 -> Zen 5 (WTF???)

16% IPC with a minimaly better process and no possibility of higher frequency.
Actually the process cant even allow to exploit the better IPC since its 11% perf/Watt improvement vs N5P is less than the IPC uplift.

StefanR5R · Aug 19, 2024

igor_kavinski said:
Zen 2 > Zen 3 (Wow!)

Abwx said:
what was remarkable was the quite better perfs in games,

That's obviously more due to the L3 cache changes, less due to microarchitectural changes. (Larger and fewer L3 cache domains.)

yuri69 · Aug 19, 2024

igor_kavinski said:
Yeah but this is the first confusing generation of Zen.

Zen and Zen+ I don't know much about. BUT

Zen 2 > Zen 3 (Wow!)

Zen 3 -> Zen 4 (Impressive!)

Zen 4 -> Zen 5 (WTF???)

* Zen -> Zen+ - proved AMD can iterate and bugfix things
* Zen+ -> Zen 2 - brought the awesome disaggregated technology; massive 64c servers; substantial core performance uplift but still not reaching Intel
* Zen 2-> Zen 3 - reached peak performance for the whole CCD; L3 helped interactive workloads; server stagnated a bit at 64c
* Zen 3 -> Zen 4 - brought marginal IPC gains with a very strong focus on frequency; temp/TDP reached the top; enabled AVX512; 128c servers dominated
* Zen 4 -> Zen 5 - brought marginal IPC gains with no possible frequency/TDP gains; full AVX512; servers ???

Rheingold · Aug 19, 2024

igor_kavinski said:
Zen and Zen+ I don't know much about

Abwx said:
5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.

The most important change was the inclusion of Precision Boost 2 which was first introduced for the Raven Ridge APUs. Where the 1800X has a single hard boost step from 4.1 GHz* down to 3.7 GHz for more than two active threads, the 2700X has the gradual opportunistic boost that's still used today. This results in more than 10% frequency uplift in a wide range of applications.

* The diagram shows 4.0 GHz, but there is a feature called "XFR Boost" that can provide another 100 MHz on top.

igor_kavinski · Aug 19, 2024

tsamolotoff said:
View attachment 105752

No wonder it's called cinememe...

Josh128 · Aug 19, 2024

Abwx said:
5% better IPC according to Computerbase charts and 350MHz higher
peak frequency, that is, 9% higher Fmax.

Good numbers in Spec but only 12-13% at Computerbase along with
200-300Mhz uplift, what was remarkable was the quite better perfs in games,
no wonder that it was aclaimed by the usual gaming crowd that is nowadays
vociferating on Zen 5.

Also 12% IPC at CB, along with a huge Fmax uplift thanks to a new node.

16% IPC with a minimaly better process and no possibility of higher frequency.
Actually the process cant even allow to exploit the better IPC since its 11% perf/Watt improvement vs N5P is less than the IPC uplift.

Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.

fastandfurious6 · Aug 19, 2024

can someone give me a quick rundown of the dual decoders situation

Abwx · Aug 19, 2024

Josh128 said:
Zen 3 to Zen 4 was only about 7% IPC increase for R23, not even close to 12%.

I m talking of the average with 13 softwares on MT, it s 12% for CB R15 and 13% for CB R20 wich is the same as R23, most impressive is the 12% with 7 Zip as INT is way more difficult to improve than FP.

AMD Ryzen 7000 im Test: So schnell sind 7950X und 7700X

AMD Ryzen 7000 ist da. Im ausführlichen Test treten Ryzen 9 7950X und Ryzen 7 7700X mit Zen-4-Kernen gegen Ryzen 5000 und Intel Core an.

www.computerbase.de

biostud · Aug 19, 2024

yuri69 said:
* Zen -> Zen+ - proved AMD can iterate and bugfix things
* Zen+ -> Zen 2 - brought the awesome disaggregated technology; massive 64c servers; substantial core performance uplift but still not reaching Intel
* Zen 2-> Zen 3 - reached peak performance for the whole CCD; L3 helped interactive workloads; server stagnated a bit at 64c
* Zen 3 -> Zen 4 - brought marginal IPC gains with a very strong focus on frequency; temp/TDP reached the top; enabled AVX512; 128c servers dominated
* Zen 4 -> Zen 5 - brought marginal IPC gains with no possible frequency/TDP gains; full AVX512; servers ???

Not to mention the zen 3 got upgraded with 3D vcache.

Mahboi · Aug 19, 2024

The AMD Ryzen 7 9700X and Ryzen 5 9600X Review: Zen 5 is Alive

www.anandtech.com

There is an oddity I can't put my finger on with Zen 5.
Right now we're speculating broadly that it needs high data throughput to get fully used. (at least I am)
INT isn't doing so good while FP/SIMD is stellar.

So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?

GTracing · Aug 19, 2024

fastandfurious6 said:
can someone give me a quick rundown of the dual decoders situation

With Zen5, AMD went from one 4 wide decoder to two 4 wide decoders. They also widened most other parts of the core. On paper it's the biggest change since Zen1. Unfortunately, the real world gains do not live up to the hype, and in some workloads are practically non-existant. No one knows why, but there are a few theories. It's probably some combination.

Chips and Cheese says in their strix point review that one thread can't use both decoders. This directly contradicts what AMD's Mike Clark has said, but if this is true then the whole core is bottlenecked by decode.
AMD removed some features in Zen5 (I believe macro-op fusion was one of them?).
Zen5 is seemingly bandwidth starved which could affect it's performance in some workloads. Zen6 is rumored to be a huge change to the cache structure/ infinity fabric/ uncore.
It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.

DavidC1 · Aug 19, 2024

Mahboi said:
So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?

Compiling is Integer workload, and that's where it's hard to improve because it's sensitive to everything. It would be very sensitive to latency of instructions, having fast caches and communication.

I would not say Browser benches are fast either. It's only true for the lower end parts, where architectural choices made to enhance the high end often disproportionately benefit the low end because they are often artificially segmented. 9950X shows almost no gains in Jetstream: https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/6

Look how there are cases where the 9950X degrades but it doesn't for 9700X.

Northwood P4 was ok mostly due to Intel's dominant process technology but the Celeron Northwood sucked, mostly because of the 1/4 cache(128KB vs 512KB). But Prescott P4 sucked, while Prescott Celeron improved greatly. It was 20-30% faster than the predecessor, because the affect of artificial L2 cache limitation was lifted by Prescott's better memory ILP and buffers.

GTracing said:
It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.

While we want Apples to Apples comparisons, in reality that is impossible.

Implementation always trumps paper level differences. In the big picture they have many more people working on ARM parts, with better opportunities, choices, and likely even pay so you have the best people working there.

I'll give you examples. Trace Cache was used to overcome this limitation on Netburst but the overall chip was trash. But the idea evolved into a much more efficient one in Sandy Bridge. It is called a micro op cache(uop cache) and it's still used today with both Intel and AMD parts. Implementation trumps theoreticals.

Zen 5 can barely use the dual decoder clusters but 4-year old Tremont can. Gracemont nearly perfects it by eliminating few cases where branches are needed for parallel execution, and Skymont further enhances it.

Gracemont: Revenge of the Atom Cores

This article can be considered a part 2 to our Golden Cove article because today we are looking at the other core in Alder Lake, Gracemont. Which is in my opinion more interesting than Golden Cove …

chipsandcheese.com

To programs, Gracemont’s decoder should function just like a traditional 6-wide decoder, fed by a 32 byte/cycle fetch port which goes both ways and compared to old school linear decoders, a clustered out-of-order decoder should lose less throughput around taken branches.

That's why I noticed that Moore's Law and computing enhancements always favor lower power and smaller sizes over outright performance gains. What some might call "democratization" of computing, allowing the average people access to some powerful stuff. Like in super poor countries that have no food to eat but Smartphones with access to wireless internet.

MS_AT · Aug 19, 2024

Mahboi said:
The AMD Ryzen 7 9700X and Ryzen 5 9600X Review: Zen 5 is Alive

www.anandtech.com

So why isn't compilation getting any kind of impressive results? Isn't it all data crunching, symbol renaming and large throughput?
Browser benches give a 33% improvement from 7600X to 9600X, which is huge, but compilation is a paltry ~10-12%. Anyone has an idea why?

You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beat

And general note , when you say INT you usually mean scalar workloads. As SIMD unit is very much capable of running INTeger operations... The thing is that scalar INT ops are using different backend than SIMD INT ops. Scalar FP ops are using the same backend as SIMD FP ops and SIMD INT ops. But I think it's similar story to MHz and MT/s with RAM. The second one is the correct term, but people are used to the first one but then it's confusing as it seems that people think the SIMD unit is incapable of doing INT ops.

GTracing said:
Chips and Cheese says in their strix point review that one thread can't use both decoders. This directly contradicts what AMD's Mike Clark has said, but if this is true then the whole core is bottlenecked by decode.

Profiling data gathered in Zen5 review, Zen4 reviews before show that lion share of ops is served from uops cache so 4 wide decode doesn't have to be a bottleneck. Also it might be that the second decoder is used in ST mode but so rarely that it's not worth mentioning according to C&C comments in 9950X review.

GTracing said:
AMD removed some features in Zen5 (I believe macro-op fusion was one of them?).

They removed nop fusion not all fusions.

GTracing said:
Zen5 is seemingly bandwidth starved which could affect it's performance in some workloads. Zen6 is rumored to be a huge change to the cache structure/ infinity fabric/ uncore.

Only in MT SIMD workloads

GTracing said:
It could also be that variable length instructions are a bigger problem than we know, and x86 fundementally can't reach the same instructions decoded per clock cycle that ARM can.

Yes, x64 is at disadvantage and they simply cannot spam decoders like ARM counterparts can. But the x64 answer seems to be decode clusters and uop caches. It's also why x64 like beefier SIMD units, you can do more work with smaller instruction footprint.

DavidC1 said:
I would not say Browser benches are fast either. It's only true for the lower end parts, where architectural choices made to enhance the high end often disproportionately benefit the low end because they are often artificially segmented. 9950X shows almost no gains in Jetstream: https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/6

Are those benchmarks multithreaded to show gains in 9950X?

DavidC1 said:
Zen 5 can barely use the dual decoder clusters but 4-year old Tremont can. Gracemont nearly perfects it by eliminating few cases where branches are needed for parallel execution, and Skymont further enhances it.

The difference is Skymont doesn't have SMT nor uop cache. Zen4 was able to keep up with Raptor Lake having 6wide decode while itself it had only 4wide. Might be that Skymont way will show the future, but so far even Intel is treading carefully as LionCove is using 8 wide decode instead of clustered approach. The reason I mention SMT is that it might be more difficult to implement clustered decode and SMT at the same time and its AMD first implementation. Gracemont that was second or third one. Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.

Mahboi · Aug 19, 2024

MS_AT said:
You could probably implement parsers with SIMD [SIMDjson, HTML parsers show its possible] but compilers we have are optimizing compilers they are doing multiple passes trying to apply different optimization opportunities. Those things are branchy by nature and don't lend itself easily to SIMD especially if you have tons of legacy code in there. Anyway if you will have some time, I recommend watching few online talks about the transformations compiler are doing to source code written by humans. I find it amazing what they are capable of doing without using any machine learning. Still handwritten SIMD code is hard to beat

Got a good recommendation for me?

DavidC1 · Aug 19, 2024

MS_AT said:
And general note , when you say INT you usually mean scalar workloads.

Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.

MS_AT said:
Tremont required artifical branch injection in instruction stream, Gracemont I think is adding fake jmp on its own.

Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.

The two are different teams and Intel plays internal politics, thus have very different approaches and ideologies. If Intel survives, the E core team is the future.

MS_AT · Aug 19, 2024

Mahboi said:
Got a good recommendation for me?

I would recommend stuff from Matt Godbolt and Chandler Carruth about compiler optimizations. Also talks about how compilers can use undefined behaviour in C++ to optimize code and how it can backfire but those might be too specific Although a warning I am unable to tell how approachable they are if you don't have already some background in software.

DavidC1 said:
Tremont didn't need injections if there were enough branches. But it worked consistently enough, unlike Zen 5. C&C testing shows this.

The two are different teams and Intel plays internal politics, thus have very different approaches and ideologies. If Intel survives, the E core team is the future.

I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Lifer

Lifer

Diamond Member

Elite Member

Lifer

Elite Member

Member

Junior Member

Lifer

Elite Member

Senior member

Member

Lifer

Senior member

Member

Lifer

Lifer

Golden Member

Member

Senior member

Senior member

Golden Member

Senior member

Senior member