arg this bugs me so much (it important to get the semantics right dammit!)
Its not AVX workloads, its 256bit ops, any 256bit ops, meaning both avx and avx 2 can target 128bit vectors if they want. For example the extra instructions in avx/avx2 vs see, or because their console code is tuned to 3 operand 128bit avx etc. ( first thing that bugs me out of they way)
Also its not really the width of the units that's the limiting factor for Zen because it has more FP units then skylake. Its the load store bandwidth in and out of the core's. Zen has 256bits load and 128bits store vs 512/256 of >Haswell.
The point being AMD AVX and AVX2 performance is fine, there isn't some magically thing making AMD crap at those instruction sets( BD and PD had real 256bit instruction issues). Its that intel have an advantage on anything thats at 256bit operation. At the same time AMD/ZEN have an advantage on 128bit operations because they have more units.
If i was amd i wouldn't go chasing 256 or 512bit avx performance or SMT 4*, i would be using the massive die and power budget those things cost to increase clocks and IPC. If you look at AMD GPU's ( or NV) they are becoming much better at being CPU like. The more GPU's compute capacity becomes flexible and general the more a 512bit CPU becomes a jack of all trade master of none. If you have a master at both you can just eat them from both sides.
The General server base doesn't care about really wide vectors, thax to intels own segmentation neither does the consumer market.
*those rumors from Fottemberg aren't worth the bits in the database they are stored on unless AMD plan on basically copying a power9 style methodology to core design which is really a more unified version of CMT.
Zen, with its 4x128 FPU pipeline, NOT SHARED with INT resources, can do (256 FP wise):
- 1 FMUL + 1 FADD or 1 FMAC
Skylake with its 2x256 pipelines:
- 1 FMUL + 1 FADD or 2 FMAC
If not using the FMACs (not all algorithm allows that), Zen can lose only on simple calculations that have lots of load and store and that are cache friendly.
If there is a FDIV, an FSQRT or more that 3-4 instructions each load or store, the bottleneck becomes the other units and not the load/store
If the code is not cache friendly (a stream of data to be added or multiplied), then the bottleneck becomes the RAM.
Even if the code is full of FMACs, if there are the two conditions above, the limiting factors are the same.
Think of Blender. Do you think that to do raytracing you don't need any division, sqrt or complicated calculus (more that 3-4 instructions) for each memory data bit?
Only simple BLAS (linear algebra) routines will see high gains from SKL architecture...