So the flop thing was for Bulldozer in general and not the CMT scaling ??
Both. It doesn't matter if AMD achieved their scaling goal (that's not what I was even talking about in the part you highlighted, and remember that an 8150 would have fared worse). When you set the bar that low...
It's pretty obvious that AMD sacrificed far too much performance in the pursuit of cost savings. Their attempt to compete with Intel on performance/dollar blew up in their face. Regardless of whether or not they achieved their measly scaling goal, it severely tarnished their brand.
The end result was a chip that was decidedly
less efficient than their previous architecture from an areal standpoint. AMD could have easily ported Thuban with updated ISA support to the 32nm process and would have come out well ahead of Bulldozer, from a perf/mm2 standpoint, and I guarantee they'd have come out significantly farther ahead on R&D costs as well. It took them an extra year to actually make some progress.
Piledriver definitely fixed a lot of the issues that Bulldozer had, but it's run into a wall. How do you fix single threaded performance if you're decode bound? You could drive up clock speed, which they've already done with Richland. You could increase the capability of the execution units, but then scaling suffers significantly when you load the second core. Obviously there are substantial inefficiencies still left in the pipeline that you could work on, but you're still going to run into that decode and I-cache wall.
The obvious answer is to expand the decode capability and increase the size of the instruction cache, which is exactly what Steamroller's doing. The work going into Steamroller predominately focused on the front end: branch prediction, pre-fetch, dispatch, 50% larger instruction cache and doubled decode. The last two changes alone probably count for around half or more of the die real estate being spent in Steamroller, and they're solely being implemented to improve second-thread scaling -- it won't help single threaded performance at all. This is why I've been stating that the focus of Steamroller is to improve second-thread scaling -- that "30% ops per cycle improvement" is all about the second thread.
That doesn't mean that Steamroller won't move single-threaded performance forward -- it absolutely will. There's so much garbage left to clean up from Bulldozer that it's virtually impossible not to improve things. The front end improvements alone will account for a sizable improvement in single-threaded performance, and there are some pretty significant changes going into the execution units as well. However, Excavator will bring the biggest changes there.
Even if I were optimistic and predicted that Steamroller gives a 30% improvement to single threaded performance, it'd still behind Intel by a good margin, and I'm also worried about the unfixed memory latency issue, which will only become a larger bottleneck with Steamroller. Where you will see Steamroller show its tenacity is with multithreading, but this is where we start getting into the land of unknowns. Loading up a second core will take up more power than it did with Piledriver, so in thermally constricted designs like mobile Kaveri, performance will be very much up to how AMD is able to keep power draw down. You'd all better hope that RCM can deliver.