Yep, that's the one.
It's an issue they should have seen from a mile away, but because it was meant as a server chip first, it was something they were willing to sacrifice on the desktop, where gobs of cache make less sense.
They're stuck in a rut, the way I see it. Unless software makes exponential strides in multi-threading (the biggest would be a truly seamless and efficient way of threading that I've only read in a Scientific American article last year. I can't find the link but if I do I'll post it here), then they won't be getting anywhere fast.
CMT is ... weird. There's 2 ways of looking at it:
1 - A core that's beefed up to a point where it can resemble 2 cores (integer in this case), or
2 - 2 cores that were stripped down such that they share resources and offer performance close to what would be achieved if they had been separate.
One of those tends to look better than the other. But I think regardless of the way you look at it, unless they're able to significantly increase IPC within the modules then AMD will simply start clocking themselves out of contention on both the server and the desktop. It's all about IPC here, and the best way to do that (from what I've been reading) would be with restructuring of the caches, and mainly the L1 and L2. But because of the way it's designed, they're intertwined. Thus addressing the size of the L1 data does little as far as smoothing those issues out, and astronomical clock speeds only point to the problem remaining. 4KB WCC doesn't look like its enough, but because of the small L1 speeds and write-through, it's all about the L2. And, for BD, it's likely the weakest point in the chip... I can't recall how many stages the pipeline had, but quite clearly the L2 is nowhere near fast enough to be so heavily relied on as far as L1 goes. The L2 has to be slow in order to have high clock speeds, which was the initial goal of BD in the first place 4.5ghz, but where will the IPC gains come from if the clock speeds are so high in Trinity? I don't see them addressing the speeds of the L2 much if they even decide to address them at all. Unless they have a secret weapon for getting that L2 up to par any clock speed bumps are trying to overcome the still poor IPC
err, just to add
The BD pipeline is ~%25 longer than the Deneb pipeline. I'm not sure how much an impact the CMT approach had to do with lengthening the pipeline, but in its current form, BD doesn't clock high enough to make up that gap and nowhere close to 25%. With a longer pipeline they're almost forced to clock higher because that's the easiest way to do things You can point to Power7 and say that it can work, and yes it can, but I don't see any Power7 desktops.