Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Markfw · Oct 18, 2024

Rheingold said:
What do you mean by "never"? Software that's well suited for SMT yields much more than that on Intel systems. I couldn't find new benchmarks, but an 8700K had several uplifts of 30% and more here. The SMT uplift on Ryzen/Epyc is higher, but Intel's was not just 10%. Or is this just about Xeons, and you can provide some data for this?

In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%

Saylick · Oct 18, 2024

naukkis said:
That was my point - 2-way SMT uplift is exactly 100% when workload is 100% memory access latency bound. This is purely mathematical fact from 2-independent threads scaling when they do not hamper other thread execution at all. Compute intensive fp-workloads scale pretty well with SMT because instruction latencies are many clock cycles but nowhere near 100% scaling. And with those kind of fp-workloads you see similar SMT gains also on Intel hardware.

Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.

MS_AT · Oct 18, 2024

Nothingness said:
I am wondering. I would have expected parallel compilation to benefit a lot from SMT.

Scaling issues. It is why servethehome.com started splitting compilation workloads into lets say 4x 64 workers to better show througput achieved by newers chips. Without doing further measurements its hard to say what exactly went wrong. This single result only tell us that compilation with 128 workers is doing better than with 256 workers.

JustViewing · Oct 18, 2024

Probably one of the reasons for AMD's better SMT performance compared to Intel may be due to split Int, FPU schedulers and execution units. When one thread is executing Integer code another thread is free to execute FPU code and vice versa. Pure speculation on my part, I have no data to back it up.

naukkis · Oct 18, 2024

Saylick said:
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.

I should have said that longer instruction latencies makes it harder to extract full hardware performance. Of course it's still possible to make well optimized fpu-algorithm that bottlenecks some part of hardware so well that SMT could only decrease performance. But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.

DrMrLordX · Oct 18, 2024

yuri69 said:
Zen 6 10+% IPC comes from the same slide stating Zen 5 10-15+%. Translated from marketing speak: Zen 6 10-14%, Zen 5 10-19%.

The current shape of Zen 5 matches this range. So the 10-14% uptick of Zen 6 comes from the current shape of Zen 5.

That doesn't address circumstances where disabling SMT on Zen5 produces performance gains (in gaming and elsewhere). You're also citing pretty wide performance ranges, so there's a big difference between Zen6 being +10% over Zen5's +10% versus Zen6 being +14% over Zen5's +19%.

Rheingold · Oct 18, 2024

Markfw said:
average Intel was 10-15% AMD was 25-30%

Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.

The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required. Perhaps it will make sense if the smaller E-cores completely replace the P-cores in the future as their performance deficit shrinks with every generation, or if those Rentable Units that were speculated about still materialize at some point.

OneEng2 · Oct 18, 2024

MS_AT said:
Nah, I don't think we have sufficient data to say Zen5 is universally faster in ST workloads or that it does universally better at MT.

Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.

Sometimes design changes are done to enable clock increase

It's doing better than Ampere that is using custom ARM cores.

Actually, it is currently only speculation that Arrow Lake will be lower in ST workloads than either 14900K or 9950X; however, unless Intel has leaked all the information to date and it is all wrong, I don't expect the reviews to show favorable ST performance for Arrow Lake. We will soon see.

I believe that the Zen 6 CCD speculation is that it will come in 8, 16, and 32 core flavors. My guess is that the 8 and 16 core variants will be Zen 6, and that the 32 core CCD will be Zen 6c. This would then provide a very diverse set of processors. You could have 8p and 32c for 40 cores and 80 threads with a "high end" processor having 16p and 32c OR 16p and 16p.

Markfw said:
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%

I either read your article, or read several others, but that is my recollection on the average as well.

Rheingold said:
Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.

The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required. Perhaps it will make sense if the smaller E-cores completely replace the P-cores in the future as their performance deficit shrinks with every generation, or if those Rentable Units that were speculated about still materialize at some point.

.... and that is my main takeaway as well. SMT provides a very good return on die space investment for MT loads. It also does so at a minimal price in thermal output per performance increase.

I believe that P cores will be around as will E cores.... but I am not so sure it makes as much sense to do it the way Intel has done having 2 totally different architectures. It seems quite certain that having 2 completely different architectures is likely much less cost and time efficient that taking a processor like a full Zen 5, modifying its transistor types for less clock speed and more power efficiency, and removing some of the cache. Not that I want to belittle the work needed to do that, but it is likely much much less effort than a completely different CPU design.

SMT really is a very cool idea. It simply allows the multiple parallel paths within a single core to be better utilized more often for a minimum additional logic added. Really don't understand where Intel's head is on this one.

DavidC1 · Oct 19, 2024

Saylick said:
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.

Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.

Cinebench is not a compute intensive FP workload. It is rather balanced with a mix of both scalar integer and FP instructions. So there's good headroom for SMT to scale, plus it's very parallel so it takes nice advantage of extra threads.

naukkis said:
But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.

What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).

You gain very little using SMT for HPC workloads because of the same reason.

DavidC1 · Oct 19, 2024

JustViewing said:
Probably one of the reasons for AMD's better SMT performance compared to Intel may be due to split Int, FPU schedulers and execution units. When one thread is executing Integer code another thread is free to execute FPU code and vice versa. Pure speculation on my part, I have no data to back it up.

This is plausible. The way they handle shared/split resources SMT on vs off likely also have to do with it too.

This is for Nehalem. I don't know for modern CPUs. Maybe we should compare them?

For original Ryzen

Saylick · Oct 19, 2024

DavidC1 said:
Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.

Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated. Idk, just seems like the FP execution engine is entirely capable of handling two threads. Fwiw, when I saturate both HT threads with this workload, the performance per thread gets cut in half so there's no net gain in throughput.

naukkis · Oct 19, 2024

DavidC1 said:
What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).

You gain very little using SMT for HPC workloads because of the same reason.

I don't know how data being fixed or floating point based would change anything with parallellization. But fp instruction latencies are generally longer than equivalent integer which potentially makes bubbles on instruction stream which SMT could fill. HPC style code with unlimited parallelism able to exploit full hardware flops are special cases not general.

JustViewing · Oct 19, 2024

naukkis said:
But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.

FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.

Even when executing heavy vector code, the integer side of the CPU might be able to squeeze in another SMT thread for additional 5-10% performance.

naukkis · Oct 19, 2024

JustViewing said:
FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.

Fp-throughput for most operations can be 1 per cycle but fp latencies surely aren´t mostly single cycle. Simplest operation add is sure single cycle latency for integer but for xmm it's 5 cycle latency for Zen1, 3 cycle for Haswell and most newer designs. Mul and div have even longer latencies. Instruction latencies matter when your code instructions are dependent to each other and next instruction can only issued after first ones results comes. With fp even doing dependent adds left two bubbles in execution pipeline which are there available for further code optimization or SMT to issue same instructions from different thread.

MS_AT · Oct 19, 2024

naukkis said:
Fp-throughput for most operations can be 1 per cycle but fp latencies surely aren´t mostly single cycle. Simplest operation add is sure single cycle latency for integer but for xmm it's 5 cycle latency for Zen1, 3 cycle for Haswell and most newer designs. Mul and div have even longer latencies. Instruction latencies matter when your code instructions are dependent to each other and next instruction can only issued after first ones results comes. With fp even doing dependent adds left two bubbles in execution pipeline which are there available for further code optimization or SMT to issue same instructions from different thread.

Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.

naukkis · Oct 19, 2024

MS_AT said:
Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.

My point was that fp-instructions have longer latencies than integer. That's pretty obvious because fixed point calculation is one operation and fp isn't, base and exponent calculations has to be serialized because they are related and cannot be completely executed simultaneously. That makes it really hard to do even simplest operations like add with single cycle latency.

csbin · Oct 19, 2024

9950X@220W vs 285K@250W

https://twitter.com/x/status/1847662406593630691

Markfw · Oct 19, 2024

So 9950x is faster with 30 watts less power ?

igor_kavinski · Oct 19, 2024

Markfw said:
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%

https://twitter.com/x/status/1702629569834938549

Yeah, Intel looks bad there.

Saylick · Oct 19, 2024

csbin said:
9950X@220W vs 285K@250W

https://twitter.com/x/status/1847662406593630691

Allegedly, that 9950X is an outlier.

Philste · Oct 19, 2024

igor_kavinski said:
Yeah, Intel looks bad there.

Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.

igor_kavinski · Oct 19, 2024

Philste said:
AMD is 26% more performance at 27% more power.

Hmm...could be caused by a lot of mispredictions.

Abwx · Oct 19, 2024

Philste said:
Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.

There must be a mistake in those numbers, because 18% better throughput imply 18% more power drained by the exe units, registers, LSU as well as adders and multipliers and so on.

DavidC1 · Oct 19, 2024

Saylick said:
Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated.

Skylake has very fat Load/Store units. It's two Load units with 32B/cycle for client and two 64B/cycle for server, which is double the ones in Haswell. That fits 2x512 units for AVX512 perfectly.

It goes back to @naukkis mentioning about Cinebench being a heavy FP workload, when it's not.

The same workload where it scales well on SMT is also same workload where it doesn't gain a lot from AVX512.

OneEng2 · Oct 19, 2024

Markfw said:
So 9950x is faster with 30 watts less power ?

That is the way I read it.... and Arrow Lake is on a newer generation process node as well. That is the most distressing part for me.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Moderator Emeritus, Elite Member

Diamond Member

Senior member

Senior member

Golden Member

Lifer

Member

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Senior member

Lifer

Lifer

Golden Member

Senior member