It was my impression that Zen would feature a greater leap forward in floating point execution resources than int. Did I misinterpret the latest leaks?
In most ways Zen's FPU is wider than Haswell's (except for fmul, where it is apparently at a 50% disadvantage). Unfortunately, we don't really know how well this compares with the internal workings of the FlexFPU in excavator (due to it being hidden behind another scheduler).
From the gcc patches (I've been examining them very carefully) it would appear that AMD has floating point advantages (vs Integer)... in fact, it looks like it would, in theory, lay the smack-down on Haswell in both areas, but there are more considerations than just these for performance:
INTEGER
Zen Integer advantages over Haswell:
Double the shift or rotates (4 vs 2)
Double the LEA instruction throughput (4 vs 2)
1/4 Division Port Usage (Haswell locks 4 ports for division)*
Zen Integer same as Haswell:
2x branch
1x indirect branch
4x mov, movx, add, cmp, etc.
1x mul, imul, iulx
Zen Integer disadvantage vs Haswell:
4x indirect branch pipeline usage vs Haswell's*
* These are ILP issues that prevent other instructions from executing/being scheduled on ANY ALU at the same time.
FLOATING POINT
Zen FPU advantage over Haswell
33% more FPU pipelines
2x fdiv
50% more mmx_add and sse_add
2x mmx_cvt
33% more sse_logic
Averages nearly twice as wide (much better ILP)
Zen FPU Identical with Haswell:
fcmp
fop
fsgn
mmxshift
Zen FPU Disadvantages vs Haswell:
1/2 ssemuladd (FMA?)
- Zen pairs with FP3 for every ssemuladd
On the whole, if the instruction latencies and the cache system were up to par, we'd expect Zen to beat Haswell in most cases per clock. However, Haswell has, 50% more AGU and a dedicated store data unit which is ganged with one of the three AGUs for stores to double the potential bandwidth (AFAICT).
I see a few low hanging fruit in the design to give it the 15% extra performance with Zen+, but they're hypothetical (the core should be able to handle it, I'm betting the front or back ends are not up to the task).
Please note, all of my information comes from the gcc source code, and I made assignment spreadsheets for Zen and Haswell:
http://looncraz.net/ZenAssignments.html
http://looncraz.net/HswAssignments.htm
Regardless, nobody actually runs single-threaded code on an XV module except for odd cases such as SuperPi or when deliberately running a multi-threaded benchmark in single-threaded mode for . . . whatever reason.
Most of the time, the extra multi-threaded performance is all you need. Games and many browser benchmarks, however, do not scale well, if at all, so higher IPC is certainly more desirable.
Anyway, The Stilt's testing showed XV to be 5% faster than SR in Cinebench R10 at the same clockspeed with the same number of modules.
I saw a 35W Excavator to 9.85% better. I think The Stilt's numbers are TDP limited, so aren't valuable for direct comparison to higher power Steamroller parts.
A 4c/8t Zen would wind up being slower - much slower - than 4m/8t XV. Yuck.
Not gonna happen. I don't think AMD could make Zen slower than Excavator if they tried. To be honest, I'm trying to figure out how they are only claiming 40% higher IPC, though I base my math exclusively on 40% - which puts Zen in Haswell territory on average with a FPU deficit.
I'm thinking the 40% IPC is integer only, and the FPU is closer to 60~80%. If the caches can keep pace, then Zen will be a great alternative to Intel's current lineup, per clock. Of course, we have a year to wait and we have no idea where the clocks will fall.
If Zen hits 3.5GHz and overclocks to 4GHz, their SMT performance will need to be impressive or they will need to add extra cores.
Up until examining the pipeline assignments I was certain AMD would have an inferior SMT design, but I seriously think they have the potential to match, or even exceed, Hyperthreading.