No they wouldn't. Their TCO is entirely too high even taking into account throughput.
you cant even remotely say things like this with any confidence , because if throughput was the key metric organisations wanted then server market share would have at least stabilized with bulldozers release and we would have seen a release of a steamroller based and excavator based server parts. those cores themselves also likely would look completely different then they do now.
If that were true, neither Intel nor AMD would even bother selling HT-capable CPUs as datacenter products. Total throughput is still going to matter. You are also failing to take into account the fact that Zen will have to have significantly higher performance from the first thread of each core vs. the first thread of each XV module in order to exceed XV's throughput given identical clockspeeds. AMD has to deliver on both counts.
You realize that if you have high ILP (aka throughput) then SMT provides very little in the way of performance benefit because your alreadly bottle necked in the core else where.
I dont know how many datacentres you have been in but i've designed converged infrastructure for more then i care to count. What does the average datacentre compute layer look like? its looks like one of two things:
1. racks of 1ru pizza boxes ( normally 2 proc)
2. racks of ~10 ru blade enclosures (normally 2 proc)
what in almost every case is running on the bear metal? KVM/ESX/hyperV. These days the only x86 stuff you see that isn't running on a hypervisor is big DB.
Now what does HT allows you do to? It allows you to increase your density of VM's. This is because the hypervisor has a NUMA aware scheduler, so if you have a VM that is running a throughput workload and maxing a core ( or several) it will move workloads around accordingly to deliver the best realtime performance it can to all VM's. So two VM's that aren't doing much end up sharing a core at that point.
Now you can always just oversubscribe the number of vthreads to real threads ( you almost always do regardless of having HT or not) but when that ratio becomes to high ( this is variable based on workload) response times become really poor, having more threads allows keeping that ratio lower and thus giving better performance( particularly in latency).
Each Zen core has twice the fp resources of a single XV module.
No it doesn't, it all depends how you count it. a XV module has 2xFMA units, Zen has 2xFMA a core. XV module has one FPU store port, we dont know how many in Zen but also likely one. XV module has two add's , Zen has two adds. XV has two muls, Zen has two muls.
What has changed is the way the hardware is arranged and it's exposure to the FPU scheduler. That change in arrangement will help with not FMA SSE and AVX but not that much for throughput and the reason for that is that the load/store system of zen is 2loads/1store a cycle. This is 1/2 of a VX module and in throughput workloads you are going to have lots of loads and store to memory sub-system. Thats why there is such a big AVX256 jump from ivory brige to haswell , because the load store bandwidth doubled.
What Zen FPU design does it reduce latency vs XV, lots of SSE,AVX workloads can have dependent instructions ( think physics games etc). Zen will reduce the cycle time of those operations vs VX and should be able to on average schedule them earlier ( more exposed ports), thats IPC increase.
Now as to integer IPC, there are lots of things in Zen to help integer IPC these will also help FP that has dependent instructions. The big one is the cache system, bulldozers L2 is a mess and its CMT's fault, high L2 latency doesn't hurt throughput but it does hurt serial code a lot. Next integer execution units are actually doubled from a CON core, this will allow things to be scheduled sooner and give a better distribution of complex instructions over the ports ( ie not having the only branch and only imul on the same port like in CON cores).
After that its the incremental things, better prefetch and predict , better memory disambiguation. There are other things that might help but we dont really know yet, like the stack cache implementation (if its explicit and has lower latency) and the "uop/trace cache" that according to patient exists in the execution/retirement stage of the pipeline not in the instruction decode like in core2 cpu's ( there also appears to be a loop buffer in decode).