IPC is commonly agreed on to mean "single threaded performance per clock."
I don't subscribe it, as we already discussed here around 1 month ago.
6 uop cycle from the ucache is enough for two heavy duty 3 IPC threads... And 10 units are eough for the peak and keep 6 uops/cycle steady execution flux (on optimized code and without cache misses, obviously)
And according to the Blender results, Broadwell(-E) resources guaranty almost exactly the same result.
http://www.agner.org/optimize/blog/read.php?i=415
According to agner, 4 fused uops, without fusion limits, so it is correct that the limit is 8 uops, but rarely there are all this fused ops...
Sure, but Intel's scheduler has a better chance to feed its ports, since it has an exact "vision" of the platform in a precise moment.
Yes but we know how dense are the AMD MOPs, up to 3 uops (or an RMW intruction), and from the uop cache, if the hit rate is enough, we optain 6 uops cycle
Intel's uop can also carry a lot of "actions/operations". That's why you see similar results in Blender. And that's why in has shown much better ST performance.
I pretty curious to see how well Zen can perform with heavy ST code, like emulators, compilers, databases, etc..
In 256 bit is correct that you have the same (fmul or fadd) or double (fmac) throughput of Zen, because it can do 2x256 bit FMAC and 2x256 VECINT, but in that case no int operations because all 4 ports are occupied, while zen can do 4 int ops in parallel. This is what i said. Without specifying bits. But it can do 1 FMUL+1 FADD or 1 FMAC+ 1 FADD or 1 FMAC + 1 FMUL or FMAC and for legacy x87 or 128 bit code Zen throughput is equal or superior: 2 FMUL + 2 FADD or 1 FMAC + 1 FMUL + 1 FADD or 2 FMAC
FMAC is emulated by FMAC + FMADD in Zen, as you know. That's why it can't reach Intel performance in this case.
Intel has also two symmetrical FPU units, that can do any FMAC/FMUL/FADD. So it can do 2 256-bit FMACs or FMULs or FADDs, whereas Zen is limited to 1 of those (per type).
It can also do 2 VALU or VShift (on the same port), and there's another one for VALU or VShuffle.
On Zen we don't know how is the situation, but if it follows a design similar to the Jaguar one, you only have 2x128-bit = 1x256 bit VShift or "conversion", and 4x128-bit = 2x256-bit VALU; I don't know about the shuffle operations.
And you are wrong about the integer units: even with all 3 Int/FPU ports busy, there's always one which is free for "integer" (ALU, Shift, Branch) operations.
Plus 2x256-bit load, 1x256 bit store, and one AGU units. Whereas here Zen has only 2 ports (the 2 AGUs) where you can submit memory operations.
I don't think that it's an outlier... Surely it's the best case, but looking at the decoders, uop cache, cache bandwidth, number of pipeline, i don't think its IPC will be so low...
I haven't said that it's low, but I think that it's likely that it'll be lower than Intel counterparts.
IF (and only if) AMD's statement is correct, 40% more IPC compared to XV can give an estimate.
Regarding SMT... At 128 bit certainly the higher ports helps to have higher MT IPC...
We have already seen Blender results...