- Mar 3, 2017
- 1,749
- 6,614
- 136
It's weird because other tests (i.e. Phoronix Epyc tests) show that enabling SMT has little to no impact on power draw but generally increases performance.Not really, just a different appproach.
AMD is 26% more performance at 27% more power.
Intel is 18% more perfomance at 3% more power.
Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
Skylake has very fat Load/Store units. It's two Load units with 32B/cycle for client and two 64B/cycle for server, which is double the ones in Haswell. That fits 2x512 units for AVX512 perfectly.
Hardware limited workload is either compute or load/store bound. SMT won't give more load/store ability(except AMD Dozers) but compute bound will benefit from SMT specially on fp because long instruction latencies usually makes full utilization of execution hardware very hard to achieve.It goes back to @naukkis mentioning about Cinebench being a heavy FP workload, when it's not.
The same workload where it scales well on SMT is also same workload where it doesn't gain a lot from AVX512.
Right, the program I run doesn’t take advantage of the wider vectors afforded by AVX2 and definitely not AVX512, hence why I think I’m L/S bound.But it's still limited to 2x128bit loads for SSE2 code like most x86-designs. Only Skymont will be able to do 3 128 bit loads. Those 256 and 512 bit loads won't help at all with SSE2.
Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).Right, the program I run doesn’t take advantage of the wider vectors afforded by AVX2 and definitely not AVX512, hence why I think I’m L/S bound.
I’m really interested in Zen 5 since it appears to handle a mix of 4 loads/2 stores per cycle, which is readily evident to me that it can truly handle two threads at full throughput. Again, more evidence that Zen 5 was designed for SMT from the get-go. My company sources our servers through Dell, so it shouldn’t be long before Turin is available for purchase. Would love to get my hands on a dual-socket R7725 blade with dual EPYC 9655. I’m in charge of one of our internal analysis committees so I have a little bit of sway on our server hardware choices.
View attachment 109739
Unfortunately, it’s old commercial software. We don’t have control over how it’s compiled.Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).
Also if your code is really pure (pure part is important here) SSE2 only then it doesn't use FMA units at all. If it is in your power I would really reconsider recompiling it to something newer. You can use 128b vectors with AVX2/512, compiler will actually emit FMA for these and it will have 16/32(AVX512) architectural registers to play with instead of 8.
My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employeesUnfortunately, it’s old commercial software. We don’t have control over how it’s compiled.
Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.That is the most distressing part for me.
Lol, I wish I could do this. Best case is I convince them to let me rent cloud compute instances to evaluate various options.My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employees
Yes. Do that!Best case is I convince them to let me rent cloud compute instances to evaluate various options.
Not a shareholder, just an American ex Navy Nuke Submariner. I very much believe that the US can not be secure without US based chip IP companies leading the market. I would very much like Intel to start acting like a tech company leader.Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.
Intel is using superior silicon on Arrow Lake compared to their own Intel fab. That is where the power efficiency for Intel comes from.Not really, just a different appproach.
AMD is 26% more performance at 27% more power.
Intel is 18% more perfomance at 3% more power.
Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
Arrow Lake isn't in the chain of discussion. It's an SMT efficiency comparison - so it's Raptor Lake S.Intel is using superior silicon on Arrow Lake compared to their own Intel fab. That is where the power efficiency for Intel comes from.
The aspect of performance per power is only one of the reasons for implementing SMT. It's nice to get it without increasing power usage much, but even at linear power increase it would be desirable because that's better than the less than linear performance-per-power increase at high frequencies. The other argument for SMT is more performance per area because the additional transistors require much less area (around 5%) than they add in performance (15-30%). The source is this document, and the money quotes are:because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
This implementation of Hyper-Threading Technology added less than 5% to the relative chip size and maximum power requirements, but can provide performance benefits much greater than that.
Measured performance on the Intel Xeon processor MP with Hyper-Threading Technology shows performance gains of up to 30% on common server application benchmarks for this technology.
Why do I say this? Because Intel says "Optimized for PPA" on their slide for the Arrow Lake P-cores:The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required.
Maybe the simplest answer is correct: removing SMT simplifies validation of the design, and makes it easier to schedule threads between the P cores and the E cores. That’s it. Anything beyond that is just marketing so that the consumer doesn’t feel like they got a downgrade.The aspect of performance per power is only one of the reasons for implementing SMT. It's nice to get it without increasing power usage much, but even at linear power increase it would be desirable because that's better than the less than linear performance-per-power increase at high frequencies. The other argument for SMT is more performance per area because the additional transistors require much less area (around 5%) than they add in performance (15-30%). The source is this document, and the money quotes are:
So...
Why do I say this? Because Intel says "Optimized for PPA" on their slide for the Arrow Lake P-cores:
View attachment 109751
This. Doesn't. Make. Sense.
And sorry for this now basically being a post for the Intel thread, but we are also discussing AMD's more effective implementation of the same technology.
Isn't it crazy what we're resorting to "taking short cuts" conclusions like this for the juggernaut that was "Big Blue", while the agile "small" contender AMD doesn't have to do this?removing SMT simplifies validation of the design
AMD never fired their entire silicon validation teamwhile the agile "small" contender AMD doesn't have to do this?
Can't. Smart. Smart not possible. Please be dumb.My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employees
How many people are well informed ? Not many. Most others will follow.Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.
Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).
Also if your code is really pure (pure part is important here) SSE2 only then it doesn't use FMA units at all. If it is in your power I would really reconsider recompiling it to something newer. You can use 128b vectors with AVX2/512, compiler will actually emit FMA for these and it will have 16/32(AVX512) architectural registers to play with instead of 8.
He mentioned GEMM and GEMM will use FMA if available so it could use it. Since SSE2 predates x64 its hard to say if his software was compiled for x86 or x64.You suppose that his code uses FMA? SSE2 in x64 mode has also 16 registers, 8 limitation is only in 32bit mode.
Intel actually offers great performance monitoring programs so developer doesn't need to guess where code is bottlenecked.
I would disagree here. I think the main reasons are "lots of design decisions made around improving SMT performance", making AMD CPUs great scalable multicore processors and not wanting to spam their CPUs with smaller cores (reasons for which I'm not entirely sure of but I think they have Mont type cores undergoing development and they are not yet ready for prime time).Main reason AMD hasn't yet is that its easy performance for them to tout and still no one really cares about security.
Thats a really high base freq compared to the rest of Zen5 lineup 🙃