Well, that's what I've been wondering too. It's not a secret that AMD wanted to "hold the line" with IPC while offering more threads and cores.
^^ Unfortunately they didn't get to Phenom II level and the clock speeds didn't reach mid 4ghz at stock. I would definitely call this AMD's Pentium 4. Holding on IPC improvements while offering more threads is all fine and dandy for a workstation or server
No dice. Among the usual means of improving IPC are techniques to reduce and hide latencies, and to improve speculation (branch prediction and data prefetching behavior), in addition to wider execution units and more work being put upon the execution schedulers. Such improvements are as good for servers as they are for desktops, even though the stalls that take up the most time may be substantially different.
While the degree of improvement will vary, enough servers perform enough different workloads that improving performance across the board is pretty much a must.
Limited performance, but high performance per Watt, can be good.
Limited performance, with OK performance per Watt, and high performance per dollar, can be good.
Limited performance, with low performance per Watt, simply is not good.
That it may perform a bit better or worse in one arena compared to another, and that's fine. But if performance is hardly acceptable for the high-volume mass market SKUs, an attempt to frame it as some server v. desktop v. mobile thing is either distraction from or rationalization for lackluster performance. All performance-driven markets need higher performance per thread per Watt than Stars.
Also, AMD doesn't have the resources nor the market share to make the push on the software side to see some Bulldozer-favored implementations through. Nor do they have a history of pushing for that either. Releasing Bulldozer without an optimized scheduler for Windows 7 when they've seen their chips' underperforming is a prime example.
Intel had the same issue with Hyperthreading, and Windows 7 had to get multiple scheduler updates for Core 2s, which were not new at the time (not without real need, either, I might add, being affected by some of the pausing bugs that got fixed!). The scheduler issues are important, certainly, but if the single-thread performance were higher, they could be glossed over and looked forward to, rather than looked at as code changes needed to fix AMD's new CPU.
AMD chose to share L1I and L2, and to make L2 big, and as such, which core (int execution unit) gets which thread matters. On the desktop, it's 99% hindrance, too
(that is the kind of feature decision that makes sense for a server, likely to be running near code paths in two threads of the same application, and likely for those threads to be working on near sets of data).
Though you do bring up a good point in that fewer cores with CMT may look entirely different. The downside to that is that the core designs, too, will look entirely different and likely something we're not going to see any time soon.
Not fewer cores, but fewer execution units (int portion of a core) per set of front ends. IE, what we have with BD is approximately 2 4-wide cores sitting behind a single 4-wide front end, which can fully serve either core at any given time, but not fully serve both (neither '4' is actually quite so definite, but that's been the case for 15 years or more).
So, given a very high ILP loop who's instruction length of very long, well-formed for BD, that is not terribly dependent on RAM or cache bandwidth, that is not limited by cache misses, and which has good (IE, easily predictable) D$/DTLB behavior...running one thread should get about the same performance as with no CMT. Meanwhile, running two of such threads should get about half the performance. At 8 cores (4 modules), that becomes four sets of front ends to eight execution units. Overall performance should scale as 1->4 400% (with one thread per module), but 1->8 will also be stuck at 400%, because the front end units are starving the rest of the CPU.
Well, in reality, such loops will only exist in synthetic benchmarks, and sustainable IPC tends to be pretty low, resulting in scaling past 1:1 execution:front-end ratio (IE, 5-8 threads per 8-core CPU) that is less than 100% for some applications, but still scaling up much better than if the next set of execution units were left inactive.
With a low-ILP loop otherwise like the exhaustively-described hypothetical one, scaling would be 800% at 8 threads for 4 modules, because the front ends would never be saturated for even one cycle.
It is a trade-off, like SMT (used for BD's FP), but one geared towards making the most use of a small number of high-ILP threads, while also being able to serve many (2x, in BD's case) low-ILP threads, with very little in the way of resource conflict stalls (IE, the bane of shared SMT in a fast CPU). And, finally, that part
seems to work just fine. Where it may be a limiting factor in higher thread count cases, it's not enough of us to be able to, at this time, easily separate its effects from those of the caches.