Skylake X and AVX512: (July 6, 2017)
Let's talk about Skylake X and AVX512. Because everyone's been waiting for this. Since there's currently a lack of AVX512 benchmarks and stress tests. And because of that, I've had at least half a dozen people and organizations contact me about y-cruncher's AVX512.
Okay...
some AVX512 benchmarks already existed.
SiSoftware Sandra had some support. And my little-known
FLOPs benchmark did too. But people either weren't aware of them, or wanted more. And by advertising y-cruncher's internal AVX512 support for at least a year now, I basically brought this on myself.
So let's get to the point. Unfortunately, AVX512 will not bring the "instant massive performance gain" that a lot of people were expecting. Realistically speaking, the speedups over AVX2 seem to vary around 10 - 50% - usually on the lower end of that scale. While the investigation is on-going, there are some known factors:
- Not all Skylake X and Skylake Purley processors will have the full AVX512 capability.
- "Phantom throttling" of performance when certain thermal limits are exceeded.
- Memory bandwidth is a significant bottleneck.
- Amdahl's law and other unknown scalability issues.
Not all Skylake X and Skylake Purley processors will have the full AVX512 capability:
While this reason doesn't apply to my system, it's worth mentioning it anyway.
Architecturally, Skylake X retains Skylake desktop's architecture with 2 x 256-bit FMA units. In Skylake X, those two 256-bit FMA units can merge to form a single 512-bit FMA. On the processors with full-throughput AVX512, there is also a dedicated 512-bit FMA - thereby providing 2 x 512-bit FMA capability.
However, that dedicated 512-bit FMA is only enabled on the Core i9 parts. The 6-core and 8-core Core i7 parts are supposed to have it disabled. Therefore they only have half the AVX512 performance.
It's worth mentioning that there is a
benchmark on an engineering-sample 6-core Core i7 that shows full-throughput AVX512 anyway. However, engineering sample processors are not always representative of the retail parts.
So as of this writing, I still don't know if the 6 and 8-core Skylake X Core i7's have the full AVX512. The only Skylake X processor I have at this time is the Core i9 7900X which is supposed to have the full AVX512 anyway. (and indeed it does based on my tests)
"Phantom throttling" of performance when certain thermal limits are exceeded:
Within minutes of getting my system setup, I started noticing inconsistencies in performance. And after spending a long Friday night investigating the issue, I determined that there was a sort of "Phantom throttling" of AVX512 code when certain thermal limits are exceeded.
"Phantom throttling" is the term that I used to describe the problem in my emails with the
Silicon Lottery vendor. And it looks like I'm not the only one using that term anymore. Phantom throttling is when the processor gets throttled without a change in clock frequency. For many years, processors have throttled down for many reasons to protect it from damage. But when throttling happens, it has always been done by lowering the clock frequency - which is visible in a monitor like CPUz. Skylake X is the first line of processors to break from this and it makes it more difficult to detect the throttling.
Right now, the phantom throttling phenomenon is still not well understood.
Overclocker der8auer has mentioned that it could be caused by CPUz not reacting fast enough to actual clock frequency changes. On the other hand, the tests that Silicon Lottery and myself have done seem to show the that there really is no drop in clock frequency at all.
Initially, I observed this effect only with AVX512 code and thus hypothesized that the mechanism behind the throttling is the shutdown of the dedicated 512-bit FMA. But others have found that phantom throttling also occurs on AVX and scalar code as well. In short, much more investigation is needed. The lack of AVX512 programs out there certainly doesn't help and is partially why I'm rushing this release of y-cruncher v0.7.3.
Currently, there are no known reliable ways of stopping the throttling and results vary heavily by motherboard manufacturer. But maxing out thermal limits and disabling all thermal protections seems to help. (Don't try this at home if you don't know what you're doing or you aren't at least moderately experienced in overclocking. You can destroy your processor and/or motherboard if you aren't careful.)
Memory bandwidth is a significant bottleneck:
y-cruncher was already slightly memory-bound on Haswell-E. Now on Skylake X, it is much worse. While I had anticpiated a memory bottleneck on Skylake X with AVX512, it seems that I've underestimated the severity of it:
(The CPU frequencies in this benchmark were chosen to be low enough to avoid any throttling or phantom throttling.)
1 billion digits of Pi - Core i9 7900X @ 3.8 GHz
Times in Seconds
Threads Memory Frequency Instruction Set
AVX2 AVX512
1 thread
2133 MHz 444.434 325.543
3200 MHz 438.432 319.737
20 threads
2133 MHz 51.884 45.658
3200 MHz 47.672 39.723
In the single threaded benchmarks, the memory frequency has less than 2% effect for both AVX2 and AVX512. Multi-threaded, that jumps to 9% and 15% respectively. This is much more than is expected for a program that used to be completely compute-bound just a few years ago.
Amdahl's law and other unknown scalability issues:
In a typical y-cruncher computation, only about 80% of the CPU time is spent running vectorized code when AVX2 is used. So by Amdahl's law, even if we get perfect scaling with the AVX512, we can only cut 40% off the run-time. Right now, the single-threaded benchmarks (which are least memory-bound) are only showing 27% speedup with AVX512 over AVX2.
This remaining 13% discrepancy is currently unresolved. Microbenchmarks of y-cruncher's AVX512 code show near perfect 2x speedups over AVX2. (Some show >2x thanks to the increased register count.) But this speedup seems to drop off as the data sizes increase - even while still fitting in cache. This seems to hint at unknown bottlenecks within the L2 and L3 caches. The fact that cache sizes haven't increased along with wider the SIMD also doesn't help.
For now, investigation is difficult because none of my performance profilers support Skylake X yet.
Implications for Stress-Testing:
y-cruncher's failure to achieve a decent speedup for AVX512 also means that it is unable to put a heavy load on the AVX512 computation units. Therefore it is not a great stress-test for Skylake X with full AVX512.
But there is one y-cruncher feature which seems to be unaffected - the BBP benchmark.
The BBP benchmark feature is contained entirely in cache is thus free of the memory bottleneck. It is able to put a much higher stress than the stress-tester and the computations. So if you run the BBP benchmark (option 4) and set the offset to 100 billion, you can still put a pretty heavy load on your AVX512-capable processor.
A future version of y-cruncher will revamp the stress-tester to incorporate the BBP benchmark as well as other possible improvements.