Throughput (incl. instructions) is not bound by latency. Many people incl. Agner Fog, Andreas Stiller and me think, the decoder's throughput is the main bottleneck.
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2
For a single instruction thread, Bulldozer offers more front end bandwidth than its predecessor. The front end is wider and just as capable so this makes sense. But note what happens when we scale up core count.
Since fetch and decode hardware is shared per module, and AMD counts each module as two cores, given an equivalent number of cores the old Phenom II actually offers a higher peak instruction fetch/decode rate than the FX. The theory is obviously that the situations where you're fetch/decode bound are infrequent enough to justify the sharing of hardware. AMD is correct for the most part. Many instructions can take multiple cycles to decode, and by switching between threads each cycle the pipelined front end hardware can be more efficiently utilized. It's only in unusually bursty situations where the front end can become a limit.
4-4-8-16 for single-dual-quad-6/8 respectively is probably limiting it but would only increase its performance in more heavily threaded scenarios. Allowing for more instructions at each stage after single means a decreased CMT tax and better multi-threaded performance. The benchmarks where BD was supposed to be heavily favored, and either didn't win or didn't win by large enough margins, would see their biggest bumps. The wider front end means higher costs, though. But there's another way of looking at this...
The multi-threaded performance, despite that occasional pileup at the decoder, isn't the problem here. The architecture was planned for mid 4ghz clock speeds so that longer pipeline and slower caches (like the poster above noted) make sense given the tradeoffs. They didn't hit those clock speed goals and it all went down hill. Given a higher clock speed (and maybe enhanced branch prediction which is much more important now than it was with previous AMD architectures. Historically AMD have always lagged behind in branch prediction so they've got a lot riding on this) the crowded front end wouldn't seem as big of a problem as we'd assume it to be.
Given the clock speeds of the leaked Trinity desktop parts I'd guess they managed to ramp up the clock speeds with the help of the RCM licensing. I'd hope for both, a wider front end and a bump in clock speeds, but given their direction I doubt we'll get the prior and they'll mostly focus on cost-cutting and the latter.
Cyclos and AMD didn't go into too much detail about Piledriver, though they did say it will consist of a 4GHz+ x86-64 core built on a 32nm CMOS process.
http://hothardware.com/News/AMD-Wil...nd-4GHz-Using-Resonant-Clock-Mesh-Technology/
I'd imagine that with Turbo we're looking at clock speeds that were intended for Bulldozer. Just how much they managed to increase that IPC, though, will determine whether it's a big increase in performance (2 generations) or what Bulldozer should have been but a year late (a single generation).