Ok, as "documented by Intel", let's check Intel Intel ia64 and ia32 architectures Optimization Reference Manual:
http://www.intel.com/content/www/us...-ia-32-architectures-optimization-manual.html
If you don't want to follow that link and see yourself, there are two relevant screenshots:
As you can see, Port0 and Port1 support VEC FMA, VEC MUL and VEC Add. But to be even more specific, Intel even includes a small table
where they specified how many units could execute some selected instructions.
Also, InstLatX64 have a small table to compare Haswell, Broadwell and Skylake:
http://users.atw.hu/instlatx64/HSWvsBDWvsSKL.txt
I quote the relevant section:
And last, as you prefer Hardware.fr, here are their measurements of Skylake:
http://www.hardware.fr/marc/skl.txt
Yep, it's a bit tedious to check EVERY supported instruction, so let's go to the four relevant ones:
To explain it a bit:
VADDPS = vector ADD packed Single-precision
VMULPD = vector MUL packed Double-precision
YMM = ful 256bit AVX register
L: 4c = Latency: 4 clocks
T: 0.5c = Throughput 0.5 clocks per instruction, or 2 instructions per clock.
Byes