I wish I could at least understand how the architecture made sense. 6 ALUs per module, two shared, bare minimum would have made sense. 8 ALUs per module, with 4 shared ALUs next to each shared FPU would have made even more sense. But 4 unshareable ALUs per module is just boneheaded.
And, how are you going to schedule all of them, and also hit decent speeds (the easiest way to speed up IO is to make transistors switch more often)? Just streamlining what had been in there since the Athlon, decoupling the branch prediction, and getting rid of reservation stations, should have provided significant speedups, though might still leave Nehalem and up doing better at stupid-simple high-IPC loops.
You seem to be looking at it like sharing execution was on the table. It wasn't, at least not for integer work. It's two separate cores, each with 2 ALU and 2 AGU pipes, that then got their front ends merged into one. There may be front end issues, but the core scaling is good enough that IPC limitations due to sharing threads does not seem to be a significant issue. I'm sure there are lingering CMT-related problems (AMD didn't exactly have a known-good physical implementation to work from), but that part of BD appears to be working as expected. Performance with several threads looks worse on Windows than Linux (indicating that performance drops with 2/4/6/8 threads is mostly suboptimal scheduling, just like with HT), but scaling in general seems to be as good as or better than Stars, even so.
A single thread not sharing anything just happens to be slower, more often than not. With the slow caches, I would also wonder about the speed of threaded-style server code (DB engines, interpreted languages, and the like). While the file compression magic might translate to those workloads, I'll remain skeptical, for now.
It is just retarded that 2 ALUs in a module will sit totally idle when there is a cache miss. Those ALUs could have and should have been put to work on the module's other thread. If they couldnt figure out a way to do that without consuming too many transistors, then the design is worthless.
Having more available is an equally valid decision to keeping a smaller number better utilized, and each will, all else being equal, shine in different workloads, just as the PhII X6 was highly competitive until SB, and still makes sense for some users, today, v. a slower i5 (v. a 2500K, if you have the budget, is another story, of course); and the same for Core 2 and Nehalem Xeons v. Magny Cours.
As a general rule, if the terms real-time or Quality of Service (QoS) come up, more unshared execution units will be the way to go (sharing = contention = unpredictability, and yes, BD's shared L1 and L2 caches are a glaring exception to this line of thought); while in general, the need for ever-increasing operations/time will get better served by sharing execution units between threads, especially if some of those threads shared significant amounts of data (competition = efficiency = high throughput). To go with more execution units (BD "core") makes sense, with AMD's main competitor preferring more threads than execution units, and AMD's cache performance constantly being a problem.
The real problem is that each BD core has come up, "just on the wrong side of, 'meh'," rather providing performance at least significantly superior to their previous CPUs across the board, if not being able to hang with Nehalem. Their competition is as much Stars as it is SB. It seems to scale, but that's only great news if you aren't paying the power bill (overclocked gaming box in a college dorm? ), or if you expect them to rapidly improve performance per Watt by 20% or more.
After that, they could use a GCN core to replace the FPU altogether. But after seeing this BD I have so little confidence that they will do anything remotely intelligent.
Not for some time. The overhead is just too much, and I would suspect that as they merge, we'll be more likely to see vector units shared between them, than for the GPU core to replace the CPU core's FPU (the primary advantages each has over the other is in how it treats data access, not the potential FLOPS each can do).