- Mar 3, 2017
- 1,749
- 6,614
- 136
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%What do you mean by "never"? Software that's well suited for SMT yields much more than that on Intel systems. I couldn't find new benchmarks, but an 8700K had several uplifts of 30% and more here. The SMT uplift on Ryzen/Epyc is higher, but Intel's was not just 10%. Or is this just about Xeons, and you can provide some data for this?
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.That was my point - 2-way SMT uplift is exactly 100% when workload is 100% memory access latency bound. This is purely mathematical fact from 2-independent threads scaling when they do not hamper other thread execution at all. Compute intensive fp-workloads scale pretty well with SMT because instruction latencies are many clock cycles but nowhere near 100% scaling. And with those kind of fp-workloads you see similar SMT gains also on Intel hardware.
Scaling issues. It is why servethehome.com started splitting compilation workloads into lets say 4x 64 workers to better show througput achieved by newers chips. Without doing further measurements its hard to say what exactly went wrong. This single result only tell us that compilation with 128 workers is doing better than with 256 workers.I am wondering. I would have expected parallel compilation to benefit a lot from SMT.
Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.
That doesn't address circumstances where disabling SMT on Zen5 produces performance gains (in gaming and elsewhere). You're also citing pretty wide performance ranges, so there's a big difference between Zen6 being +10% over Zen5's +10% versus Zen6 being +14% over Zen5's +19%.Zen 6 10+% IPC comes from the same slide stating Zen 5 10-15+%. Translated from marketing speak: Zen 6 10-14%, Zen 5 10-19%.
The current shape of Zen 5 matches this range. So the 10-14% uptick of Zen 6 comes from the current shape of Zen 5.
Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.average Intel was 10-15% AMD was 25-30%
Actually, it is currently only speculation that Arrow Lake will be lower in ST workloads than either 14900K or 9950X; however, unless Intel has leaked all the information to date and it is all wrong, I don't expect the reviews to show favorable ST performance for Arrow Lake. We will soon see.Nah, I don't think we have sufficient data to say Zen5 is universally faster in ST workloads or that it does universally better at MT.
Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.
Sometimes design changes are done to enable clock increase
It's doing better than Ampere that is using custom ARM cores.
I either read your article, or read several others, but that is my recollection on the average as well.In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%
.... and that is my main takeaway as well. SMT provides a very good return on die space investment for MT loads. It also does so at a minimal price in thermal output per performance increase.Here's another data point using the SPEC suite: +26.3% for Zen 4, +18.82% for Golden Cove. That's more in line of what I'd expect. SMT works better on AMD, but the uplift on average isn't twice as high or even more, but more like 40% better. Of course this can always be skewed in one direction or the other by carefully choosing the tested applications.
The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required. Perhaps it will make sense if the smaller E-cores completely replace the P-cores in the future as their performance deficit shrinks with every generation, or if those Rentable Units that were speculated about still materialize at some point.
Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.Interesting you say this, because the workload I use for work sees significant performance degradation when we have more threads than physical cores on the computer and it is structural analysis software, which should involve a ton of matrix math (GEMM). I suspect the bottleneck may be the load-store units instead of the FP execution.
What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.
This is plausible. The way they handle shared/split resources SMT on vs off likely also have to do with it too.Probably one of the reasons for AMD's better SMT performance compared to Intel may be due to split Int, FPU schedulers and execution units. When one thread is executing Integer code another thread is free to execute FPU code and vice versa. Pure speculation on my part, I have no data to back it up.
Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated. Idk, just seems like the FP execution engine is entirely capable of handling two threads. Fwiw, when I saturate both HT threads with this workload, the performance per thread gets cut in half so there's no net gain in throughput.Matrix Math is one workload that doesn't benefit much from SMT, or if at all because it can saturate the FP units.
What workloads make FP benefit more from HT than with Integer? FP is the one that's easier to parallelize(has roots back in an accelerator in the pre-486 days).
You gain very little using SMT for HPC workloads because of the same reason.
FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.But because fp instruction latencies are generally longer than integer SMT scaling is usually better on fp than pure integer workloads.
FP Latencies are not that long, most operations can be performed in a single cycle. Even mul operation can be performed in 3 cycles. If the FP what we are talking about is vector FP, then most likely the execution units will be fully saturated. Most of the time, vector execution is performed on relatively large dataset which has enough operations to perform on available data and able to fully saturate the FPU execution units. However, if FPU is just used for small calculations/ scalar then it can benefit from SMT.
Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.Fp-throughput for most operations can be 1 per cycle but fp latencies surely aren´t mostly single cycle. Simplest operation add is sure single cycle latency for integer but for xmm it's 5 cycle latency for Zen1, 3 cycle for Haswell and most newer designs. Mul and div have even longer latencies. Instruction latencies matter when your code instructions are dependent to each other and next instruction can only issued after first ones results comes. With fp even doing dependent adds left two bubbles in execution pipeline which are there available for further code optimization or SMT to issue same instructions from different thread.
Latency is 2-3 for FP add, 3 for FP mul and 4 for FMA. 2 adds and 2 fma/mul per cycle. Zen5 is capable of sustaining that from single thread. With AVX512 there should be enough of architectural registers that compiler will be able to hide the latencies.
In the article I wrote it was an average. Can't find it now. average Intel was 10-15% AMD was 25-30%
Allegedly, that 9950X is an outlier.
Not really, just a different appproach.Yeah, Intel looks bad there.
Hmm...could be caused by a lot of mispredictions.AMD is 26% more performance at 27% more power.
Not really, just a different appproach.
AMD is 26% more performance at 27% more power.
Intel is 18% more perfomance at 3% more power.
Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
Skylake has very fat Load/Store units. It's two Load units with 32B/cycle for client and two 64B/cycle for server, which is double the ones in Haswell. That fits 2x512 units for AVX512 perfectly.Considering that my workload only uses SSE2 and runs on Skylake-based servers (Xeon Gold 6146, specifically), which has dual FMA units but only two loads and one store unit, I am not sure it's the FP units which are saturated.
That is the way I read it.... and Arrow Lake is on a newer generation process node as well. That is the most distressing part for me.So 9950x is faster with 30 watts less power ?