Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

bakyt115 · Jul 29, 2024

may be if someone with 370 in hand can disable SMT and e cores and makes some test then we can compare results with 4 core zen4

also interesting to see core to core comparison with SMT on and off.

MangoX · Jul 29, 2024

I'm not sure it was mentioned, but why did AMD split stx into 2 ccx's? I think that totally killed the core to core latency. I mean it can't be a technical problem. Because after all, the 8500G has 2x z4 and 4x z4c sharing the same ccx without the latency problem. Why AMD? Why?

bakyt115 · Jul 29, 2024

igor_kavinski said:
So how would the 2nd decoder be used? It would decode the instruction immediately after the one being decoded by the first decoder?

as I understand from tremont discussion second decoder should decode second path of branch

CouncilorIrissa · Jul 29, 2024

bakyt115 said:
may be if someone with 370 in hand can disable SMT and e cores and makes some test then we can compare results with 4 core zen4

also interesting to see core to core comparison with SMT on and off.

AMD's laptop firmware hasn't allowed disabling SMT ever since Renoir from what I've heard.

bakyt115 · Jul 29, 2024

MangoX said:
I'm not sure it was mentioned, but why did AMD split stx into 2 ccx's? I think that totally killed the core to core latency. I mean it can't be a technical problem. Because after all, the 8500G has 2x z4 and 4x z4c sharing the same ccx without the latency problem. Why AMD? Why?

to save area

MS_AT · Jul 29, 2024

Jan Olšan said:
The encoding test are also all over the place (looking at AV1), with there being large gap between Phoenix and Strix and then the difference gets swapped in the same test but different resolution? Feels like something doesn't quite work in the scheduling or in clock management. Could also be SVT having awful threading model perhaps. AV1 encoding via Handbrake seemed to suck in ComputerBase review too, perhaps the encoder sucks on big.little (although in massively threaded up, it should effectively stop being big.little except for the caches). Maybe the encoder is prone to some problem with threading or dispatching SIMD on zen 5 / strix for some reason.

The AMD Ryzen AI 9 HX 370 Review: Unleashing Zen 5 and RDNA 3.5 Into Notebooks

www.anandtech.com

That is another thing that would be nice if AT investigated. I mean it would be nice they could comment on both the core performance in isolation (pin the workload to single core, double check it stays there and measure whatever needs measuring) and MT/SoC performance where scheduling issues could be confirmed and pointed out. Now we get visual representation that something was not exactly ok [but was that the test procedure, the CPU itself or Windows] but we are left guessing what. Of course I understand those things take time but they could be highlighted in the review and it could be mentioned that they will be investigated in another piece [bonus points for keeping up the promise and doing the piece]

GTracing · Jul 29, 2024

igor_kavinski said:
So how would the 2nd decoder be used? It would decode the instruction immediately after the one being decoded by the first decoder?

My understanding is that when there's a branch, each decoder will work on a separate branch. Here's the chips and cheese article on the matter. https://chipsandcheese.com/2024/07/...t-how-30-year-old-idea-allows-for-new-tricks/

RTX2080 · Jul 29, 2024

bakyt115 said:
which of zen5 products will be on 3nm?

techjunkie123 said:
Maybe Strix halo too?

AMD is waiting for N3(E) for StrixHalo, and would be expensive for sure.

Josh128 said:
Certainly PBO or a manual OC though. If it achieved it at 230W, its so odd that AMD chose to require PBO to reach that perf instead of letting the proc handle it itself. Only thing I can think of is its a reliability/culpability play, and the voltage and current requirements to hit this ~45K are just not something they are comfortable having to warranty.

2 reason I guess: 1, there's no competition currently. 2, AMD is also afraid of silicon degrading which happened at Raptorlake.

MS_AT · Jul 29, 2024

GTracing said:
AMD's Mike Clark gave interviews last week to Chips and Cheese and Ian Cutress. He said that practically all core resources can be used by a single thread.

That is why it would be nice if somebody would test decode behaviour with SMT on/off in BIOS for 1T load to put the doubts to rest.

Josh128 · Jul 29, 2024

inf64 said:
Interesting post by David Huang;

https://twitter.com/x/status/1817744992846414238

"The consequence of Zen 5's initial release to most media outlets for testing on ultra-thin notebooks is that you can't even find a few Cinebench tests where a single core ran at full frequency without being throttled..."

No wonder AT couldn't measure any ST IPC increase in Specint while David measured around 10% jump vs Zen 4 mobile part.

Another comment (spicey language):

https://twitter.com/x/status/1817781543827620066

edit;
One more

x.com

x.com

"I suggest you wait until I finish running SPEC and GB under Linux in a few days before drawing any conclusions.In addition, if you have read my previous analysis of performance bottlenecks, you will know that even for a 6-wide 4ALU x86 processor, the performance bottleneck is mostly not in the decoding width or the number of ALUs."

Problem is, from leaks we've seen in R23 ST, even desktop silicon is also not holding its full single core boost freqs.

Josh128 · Jul 29, 2024

RTX2080 said:
2, AMD is also afraid of silicon degrading which happened at Raptorlake.

Almost certainly this. If it could be done with reliability guaranteed, theres no way they wouldnt have done it with Arrow Lake looming.

CouncilorIrissa · Jul 29, 2024

Strix Halo

HP - Geekbench

Benchmark results for a HP with an AMD Eng Sample: 100-000001422-31_N processor.

browser.geekbench.com

HP - Geekbench

Benchmark results for a HP with an AMD Eng Sample: 100-000001422-31_N processor.

browser.geekbench.com

inf64 · Jul 29, 2024

Josh128 said:
Problem is, from leaks we've seen in R23 ST, even desktop silicon is also not holding its full single core boost freqs.

There is still time for bios updates, I hope.

poke01 · Jul 29, 2024

Josh128 said:
Problem is, from leaks we've seen in R23 ST, even desktop silicon is also not holding its full single core boost freqs.

This also happens with Apples chips too, they don’t reach max clocks in the r23 ST test but they do in r2024.

poke01 · Jul 29, 2024

CouncilorIrissa said:
Strix Halo

HP - Geekbench

Benchmark results for a HP with an AMD Eng Sample: 100-000001422-31_N processor.

browser.geekbench.com

HP - Geekbench

Benchmark results for a HP with an AMD Eng Sample: 100-000001422-31_N processor.

browser.geekbench.com

Strange how it’s reporting 3.4GHz as the clock speed.

mostwanted002 · Jul 29, 2024

cmon. we are a few post away from the glorious page 690

CouncilorIrissa · Jul 29, 2024

poke01 said:
Strange how it’s reporting 3.4GHz as the clock speed.

Just don't bother at this point.
Even retail STX has weird clock reporting at times (case in point: same laptop, close scores, vastly different reported clocks)

bakyt115 · Jul 29, 2024

there was paper about power consumption of x86 (in comparison with arm). and there was statement that decoding consume nearly 20% (can be wrong in numbers can't find article itself) of power for old tiny atom cores.

may be second decoder is now issue for zen5 core to hold max frequency in 1t workloads

MS_AT · Jul 29, 2024

Josh128 said:
Problem is, from leaks we've seen in R23 ST, even desktop silicon is also not holding its full single core boost freqs.

Have you seen the clocks next to the score? Or do you base it on the fact that score is less than expected? These two things don't need to go hand in hand [although it would be better if boost was not reached for the leaked scores, then there would be a chance something can be tuned in BIOS to boost the clock to advertised values]

Philste · Jul 29, 2024

Rheingold said:
The performance at the same wattage is 17-34% higher. Notebookcheck compared against the Z1 Extreme and 8945HS:

But this also shows that ZEN5 need more juice compared to ZEN4. The advantage at low wattages is pretty meh for 50% more threads. It only starts getting decent at power levels above ZEN4s sweetspot. I'm 80% sure we will see that every ZEN5 Desktop SKU is slower than it's predecessor at low wattages. Similar to Igors Leaks, the ZEN5 ES was besten by 7950X pretty much everywhere below 100W. ZEN5 needs Juice to run properly.

Josh128 · Jul 29, 2024

MS_AT said:
Have you seen the clocks next to the score? Or do you base it on the fact that score is less than expected? These two things don't need to go hand in hand [although it would be better if boost was not reached for the leaked scores, then there would be a chance something can be tuned in BIOS to boost the clock to advertised values]

Its because the scores are not equaling +17% vs known Zen 4 SKUs at known clocks. We are seeing ~9%-14% for 9600X and 9700X vs their corresponding Zen 4 SKUs depending on whether PBO is on or off.

igor_kavinski · Jul 29, 2024

GTracing said:
My understanding is that when there's a branch, each decoder will work on a separate branch. Here's the chips and cheese article on the matter. https://chipsandcheese.com/2024/07/...t-how-30-year-old-idea-allows-for-new-tricks/

So ideally, there should be 4 decoders, to handle the case of two branching pathways and the instruction to be executed after the branch is entered.

LightningZ71 · Jul 29, 2024

Thinking about the Zen5c CCX on Strix Point, that's the least amount of L3 cache per thread since Lucienne. Yes, the L2 is twice as large and still exclusive, so the apparent L2/L3 cache is 33% larger, but Zen5c, even restricted to ~3.6Ghz, is much higher performance that Zen2 at up to 4-4.5Ghz. This becomes an even bigger issue with AVX-512 code as that can quickly become memory throughput bound.

All of that is to say that the Zen5c CCX is very memory bandwidth starved and will be hitting the memory controller and snooping the other CCX a lot. This seems like a very hamstrung design unless the memory controller has gobs of bandwidth available at low latency. A MALL cache would have made this make sense. The current design just exacerbates existing problems with limited bandwidth.

This leads me to believe that AMD changed course hard after they had frozen critical aspects of Strix Point's design, like the rumored NPU expansion that removed the MALL. They wanted to hit a marketing goal of having a processor that looked like Alder Lake/Raptor Lake, but things went sideways along the way and they didn't course correct well.

Ghostsonplanets · Jul 29, 2024

techjunkie123 said:
Maybe Strix halo too?

IOD is said to be N3E, yes.
Compute Die should be N4P still.

SarahKerrigan · Jul 29, 2024

igor_kavinski said:
So ideally, there should be 4 decoders, to handle the case of two branching pathways and the instruction to be executed after the branch is entered.

What matters is basic blocks. The two-ahead predictor allows for the frontend to make forward progress on the BB from the next taken branch, then the BB from the taken branch after that.

I'm not sure how you got "it needs four decoders" from that.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Member

Senior member

Member

Senior member

Member

Senior member

Member

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Golden Member

Golden Member

Member

Senior member

Member

Senior member

Senior member

Senior member

Lifer

Golden Member

Senior member

Senior member