For maximum throughput, the FMA units need to be utilized as much as possible. To achieve this,
- avoid to overuse the caches, as discussed. If the active data cannot be cached completely, the memory controller(s) and RAM likely don't have the high throughput and low latency which is required to keep the FMA execution units fed.
- Use a sufficiently high number of threads in total across all tasks, such that all FMA execution units are engaged.
- Use as few threads per task as possible (while keeping the above in mind). More threads per task mean more time spent with inter-thread synchronization than with actual computation.
For Zen2, @biodoc's results from 2019 with <2 MB FFT data size may still be representative of what works best with <4 MB FFT data size, since Zen2 has got 4 MB L3$ per core and 1 FMA3 unit per core (or: 1 AVX2 pipeline per core). @biodoc measured PPD and PPD/W back then:Has anybody tested the optimal thread count(in terms of PPD) for this FFT length yet?