Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 839 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
What do you mean by "never"? Software that's well suited for SMT yields much more than that on Intel systems. I couldn't find new benchmarks, but an 8700K had several uplifts of 30% and more here. The SMT uplift on Ryzen/Epyc is higher, but Intel's was not just 10%. Or is this just about Xeons, and you can provide some data for this?
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%
 
Reactions: Tlh97 and OneEng2

Saylick

Diamond Member
Sep 10, 2012
3,645
8,223
136
That was my point - 2-way SMT uplift is exactly 100% when workload is 100% memory access latency bound. This is purely mathematical fact from 2-independent threads scaling when they do not hamper other thread execution at all. Compute intensive fp-workloads scale pretty well with SMT because instruction latencies are many clock cycles but nowhere near 100% scaling. And with those kind of fp-workloads you see similar SMT gains also on Intel hardware.
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
I am wondering. I would have expected parallel compilation to benefit a lot from SMT.
Scaling issues. It is why servethehome.com started splitting compilation workloads into lets say 4x 64 workers to better show througput achieved by newers chips. Without doing further measurements its hard to say what exactly went wrong. This single result only tell us that compilation with 128 workers is doing better than with 256 workers.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.

I should have said that longer instruction latencies makes it harder to extract full hardware performance. Of course it's still possible to make well optimized fpu-algorithm that bottlenecks some part of hardware so well that SMT could only decrease performance. But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.
 

DrMrLordX

Lifer
Apr 27, 2000
22,184
11,890
136
Zen 6 10+% IPC comes from the same slide stating Zen 5 10-15+%. Translated from marketing speak: Zen 6 10-14%, Zen 5 10-19%.

The current shape of Zen 5 matches this range. So the 10-14% uptick of Zen 6 comes from the current shape of Zen 5.
That doesn't address circumstances where disabling SMT on Zen5 produces performance gains (in gaming and elsewhere). You're also citing pretty wide performance ranges, so there's a big difference between Zen6 being +10% over Zen5's +10% versus Zen6 being +14% over Zen5's +19%.
 
Reactions: Tlh97

Rheingold

Member
Aug 17, 2022
69
203
76
average Intel was 10-15% AMD was 25-30%
Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.

The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required. Perhaps it will make sense if the smaller E-cores completely replace the P-cores in the future as their performance deficit shrinks with every generation, or if those Rentable Units that were speculated about still materialize at some point.
 

OneEng2

Senior member
Sep 19, 2022
259
358
106
Nah, I don't think we have sufficient data to say Zen5 is universally faster in ST workloads or that it does universally better at MT.

Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.

Sometimes design changes are done to enable clock increase

It's doing better than Ampere that is using custom ARM cores.
Actually, it is currently only speculation that Arrow Lake will be lower in ST workloads than either 14900K or 9950X; however, unless Intel has leaked all the information to date and it is all wrong, I don't expect the reviews to show favorable ST performance for Arrow Lake. We will soon see.

I believe that the Zen 6 CCD speculation is that it will come in 8, 16, and 32 core flavors. My guess is that the 8 and 16 core variants will be Zen 6, and that the 32 core CCD will be Zen 6c. This would then provide a very diverse set of processors. You could have 8p and 32c for 40 cores and 80 threads with a "high end" processor having 16p and 32c OR 16p and 16p.
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%
I either read your article, or read several others, but that is my recollection on the average as well.
Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.

The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required. Perhaps it will make sense if the smaller E-cores completely replace the P-cores in the future as their performance deficit shrinks with every generation, or if those Rentable Units that were speculated about still materialize at some point.
.... and that is my main takeaway as well. SMT provides a very good return on die space investment for MT loads. It also does so at a minimal price in thermal output per performance increase.

I believe that P cores will be around as will E cores.... but I am not so sure it makes as much sense to do it the way Intel has done having 2 totally different architectures. It seems quite certain that having 2 completely different architectures is likely much less cost and time efficient that taking a processor like a full Zen 5, modifying its transistor types for less clock speed and more power efficiency, and removing some of the cache. Not that I want to belittle the work needed to do that, but it is likely much much less effort than a completely different CPU design.

SMT really is a very cool idea. It simply allows the multiple parallel paths within a single core to be better utilized more often for a minimum additional logic added. Really don't understand where Intel's head is on this one.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.
Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.

Cinebench is not a compute intensive FP workload. It is rather balanced with a mix of both scalar integer and FP instructions. So there's good headroom for SMT to scale, plus it's very parallel so it takes nice advantage of extra threads.
But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.
What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).

You gain very little using SMT for HPC workloads because of the same reason.
 
Reactions: Tlh97

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Probably one of the reasons for AMD's better SMT performance compared to Intel may be due to split Int, FPU schedulers and execution units. When one thread is executing Integer code another thread is free to execute FPU code and vice versa. Pure speculation on my part, I have no data to back it up.
This is plausible. The way they handle shared/split resources SMT on vs off likely also have to do with it too.

This is for Nehalem. I don't know for modern CPUs. Maybe we should compare them?

For original Ryzen
 

Saylick

Diamond Member
Sep 10, 2012
3,645
8,223
136
Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.
Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated. Idk, just seems like the FP execution engine is entirely capable of handling two threads. Fwiw, when I saturate both HT threads with this workload, the performance per thread gets cut in half so there's no net gain in throughput.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).

You gain very little using SMT for HPC workloads because of the same reason.

I don't know how data being fixed or floating point based would change anything with parallellization. But fp instruction latencies are generally longer than equivalent integer which potentially makes bubbles on instruction stream which SMT could fill. HPC style code with unlimited parallelism able to exploit full hardware flops are special cases not general.
 

JustViewing

Senior member
Aug 17, 2022
225
408
106
But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.
FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.

Even when executing heavy vector code, the integer side of the CPU might be able to squeeze in another SMT thread for additional 5-10% performance.
 
Reactions: Tlh97 and MS_AT

naukkis

Senior member
Jun 5, 2002
962
829
136
FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.

Fp-throughput for most operations can be 1 per cycle but fp latencies surely aren´t mostly single cycle. Simplest operation add is sure single cycle latency for integer but for xmm it's 5 cycle latency for Zen1, 3 cycle for Haswell and most newer designs. Mul and div have even longer latencies. Instruction latencies matter when your code instructions are dependent to each other and next instruction can only issued after first ones results comes. With fp even doing dependent adds left two bubbles in execution pipeline which are there available for further code optimization or SMT to issue same instructions from different thread.
 

MS_AT

Senior member
Jul 15, 2024
365
798
96
Fp-throughput for most operations can be 1 per cycle but fp latencies surely aren´t mostly single cycle. Simplest operation add is sure single cycle latency for integer but for xmm it's 5 cycle latency for Zen1, 3 cycle for Haswell and most newer designs. Mul and div have even longer latencies. Instruction latencies matter when your code instructions are dependent to each other and next instruction can only issued after first ones results comes. With fp even doing dependent adds left two bubbles in execution pipeline which are there available for further code optimization or SMT to issue same instructions from different thread.
Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.

My point was that fp-instructions have longer latencies than integer. That's pretty obvious because fixed point calculation is one operation and fp isn't, base and exponent calculations has to be serialized because they are related and cannot be completely executed simultaneously. That makes it really hard to do even simplest operations like add with single cycle latency.
 

Philste

Senior member
Oct 13, 2023
259
454
96
Yeah, Intel looks bad there.
Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
 

Abwx

Lifer
Apr 2, 2011
11,612
4,469
136
Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.

There must be a mistake in those numbers, because 18% better throughput imply 18% more power drained by the exe units, registers, LSU as well as adders and multipliers and so on.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated.
Skylake has very fat Load/Store units. It's two Load units with 32B/cycle for client and two 64B/cycle for server, which is double the ones in Haswell. That fits 2x512 units for AVX512 perfectly.

It goes back to @naukkis mentioning about Cinebench being a heavy FP workload, when it's not.

The same workload where it scales well on SMT is also same workload where it doesn't gain a lot from AVX512.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |