Wow, this is interesting. So what is the stimate of the performance of Zen?
If goes to Haswell levels on ST, I'll go with AMD, not for performance, but to maintain a competitor alive (just like the people who watched the Warcraft movie)
Not one of us has any real clue about performance (some peoples guesses might be right but its not much more then flipping a coin). My approch has been to look for upper/lower bounds, things that could limit ST performance.
Right now there isn't really anything we can find that says Zen can have just as high ST perf as broadwell, skylake and beyond have bigger internal structures so should get more ILP all other things being equal.
What we have no idea about is prefetchers/predictors, but this is one area CON/CAT was actually pretty good at, i would expect evolution on these units from those cores, so likely they are behind intel in that regard but its probably not going to be massive. In cases intel have actually turned down the aggressiveness of their prefetch/predict because they could gain more from power saved(thus higher clock) then the perf lost by the extra cache misses.
There are a few "interesting points" that we can see with Zen for ST:
Does the 6 small integer schedulers have a impact on extractable ILP vs one big one, if so how much.
Whats the miss predict penalty now with both a "checkpoint unit" amd said almost nothing about and with the uop cache.
With SMT the interesting one for me is the hard partitioned store queue, i've offen wondered how late (aggressive) intel are in terms of writing data out to cache, There is a lot of power to be saved if you can store to load forward it compared to writing it to cache and then reading it again.
A lot more of the questions are actually around SIMD performance, this whole stack engine discusions is really about having enough load/store and address generation for very heavy sustained amounts of 3 operand operations (AVX, FMA).
On top of this a Zen core "only" has IVB amount of load and store in and out of the core and only 128bit wide FP units. for SSE and x64/x86 the 2AGU to 4 ALU should almost never be an issue, intel only added 3rd AGU when they went to 256bit units and data paths so thats a hint as well. Now per clock for 128bit SIMD ops Bulldozer/piledriver were approx a match (module vs core) for SIMD, in 256bit ops bulldozer/piledrive had some penalties that lowered their performance. So when we look at Zen, it has quite a large amount of changes to FP SIMD handling which should help boost its performance above CON core level and hopefully IVB for 256bit ops. for 128 bit ops Zen could very well be ahead of all intel chips we will have to wait and see. The improvements we know Zen has over CON cores are:
Larger FP register file:
More execution resources
lower execution latency ( except for FMA)
lower load to use latency
a buffer pre FPU scheduler so ops can still be sent to FPU even if FPU scheduler is full
What we dont know if Zen has improved 256bit handling but its a reasonably safe assumption it has as these issues where internal FPU issues, not core wide things.
Now clocks are anyones guess, there is Fmax for the core on the process it on, Fxmax for the core on an ideal process (not saying LLP isn't an "ideal" process) and clock/power scaling.
Now personally for me there have only been two people i would value there words on in terms of Zen performance, one is Neilz at B3D who works for an OEM and Thevenin who also works for an OEM. Both have said quite positive things about Zen without being specific ( Neilz was commenting on the 32 core part) , Thevenin appeared to be commenting on the 8 core part.