I see a lot of misconceptions in this thread.
1. Thread parent in the same paragraph claimed that Zen would be a hit in HPC while simultaneously denying the existence of AVX software. HPC runs 100% AVX workloads, and HPC-specific literature talks about AVX throughput extensively.
Zen is clearly an efficient design, and it could potentially replace Intel or act as a second vendor in a range of server use cases, particularly web and database. However, there are several obstacles preventing it from being a general replacement for Intel's Xeon series. Lack of AVX (especially AVX-512) units make it entirely irrelevant in HPC. Current rumors for Naples indicate a 4-die MCM package, which would essentially be three layers of NUMA (2 sockets, 4 dies, 2 CCX per die). Compared to Intel's unified L3 cache on Xeon E5s, this would be a significant challenge to scale-up (big server) workloads.
As a scientist who actually uses a number of HPC programs to run calculations, I take issue with your statement that HPC is "100% AVX workloads." While there certainly are programs out there that support AVX and gain a significant speedup by using it, the number of programs that use AVX and benefit from it are more limited than you suggest. I never intended to claim that there were
no AVX HPC programs (and if I did otherwise then it is my mistake), just that the number of real HPC programs that use and take advantage of AVX are not as extensive as you might believe.
The issue with AVX is that its acceleration is achieved, as its name implies, through new vector instructions. However, not all problems can be easily vectorized (or even vectorized at all). Those with even minimal programming experience should understand that turning scalar code into vector code isn't always a trivial task, and sometimes is just impossible. There's two issues here: 1) the theoretical issue of vectorization (how can I express my problem in a vectorized/array manner?) 2) practical implementation (how to get the system to recognize and accelerate your vectorized code, either through the use of intrinsics or a compiler). Again, I'm not a programming expert, but I can grasp and appreciate the difficulty of this problem.
Code with many scalar sections are unlikely to see much of a speedup (see the link below). Having heard directly from some programmers writing HPC chemistry ab intio/DFT code, the largest gains they've seen from AVX are up to 10%. The issue, quoting them directly, is that scalar sections (as discussed above) don't get a speedup and thus limit the benefit you get from AVX.
You don't even have to believe me or the programmers I'm talking to.
This presentation from CERN's physics HPC group brings up many of the exact same issues that I discussed.
As for the literature on AVX in HPC, I think some critical reading is needed. I won't deny that AVX is capable of producing tremendous gains in terms of FP performance. But translating those FLOPS to actual, real-life performance isn't easy, as the article cited by Kalmquist and the document I linked to shows.
IMO the most important thing AMD needs to worry about for HPC is not AVX, but memory bandwidth and the processor interconnect in SMP systems. A lot of programs are limited not just by processing speed but the system's ability to keep all of those processors fed from memory. The "three layers of NUMA" you alluded to might be more problematic than weak AVX performance.