I would however be interested in seeing the conclusive proof "there was no "potential performance" that could feasibly be unlocked by adopting better optimisations" with the CMP line.
That's actually quite straightforward, and we can do it with a direct comparison of K10 with the general Bulldozer-family blueprint.
Recall that K10 was the last in a long line of AMD's triple-pipeline CPUs, starting with the original Athlon. This made it a very mature and well-understood optimisation target. It was also exceptionally tolerant of badly optimised code, or code optimised for a completely different CPU such as the Pentium 4 family. Each of its three integer pipelines could handle the great majority of integer operations, including LEA and load-store ops (as they had an embedded AGU as well as an ALU). Memory and cache latencies were also respectably low, these having been a headline feature of K8. Thus, on any sort of integer code that wasn't memory-limited, it could get remarkably close to its theoretical IPC of 3.
K10 did have two serious weaknesses: primarily the FPU, which was very outdated and could execute only one (possibly SIMD) add plus one multiply per cycle. The three-wide design also extended to the retirement unit, which prevented the latter from clearing the pipelines any faster than the front-end could stuff new instructions into them. This could lead to front-end stalls when instructions of mixed latency were in play.
Now recall that each Bulldozer core (of which two per module, up to eight per die) has only two ALUs and two AGUs. Assuming the full four-wide front-end is available (ie. a single-threaded workload), this means the front-end can generate twice as many macro-ops as the back-end can execute (as each macro-op can use both an ALU and an AGU). The integer back-end is therefore 33% narrower than K10 and is obviously a potential bottleneck. Or, to put it another way, it needed 50% more clock speed to merely equal K10's integer throughput. If that sounds at all familiar, think of the Pentium 4.
Then we come to the memory performance, which was abysmal and remained so throughout the family series. The memory controller itself was fine, as evidenced by the perfectly good performance achieved by iGPUs sharing it. The problem was the cache hierarchy, which appeared to have been specified and designed by a committee of chimpanzees on mind-altering drugs. Or, more plausibly, it was the result of sticking religiously to synthesised SRAM cells (as opposed to hand-optimised ones) and still striving for that 50% higher core clock speed. Whatever the cause, it had bad latency, worse throughput and terrible hit rates (especially at L1, which was the only one actually running at core frequency). A lot of Excavator's improved IPC over Steamroller comes from having a 32KB L1-D instead of 16KB.
The one thing that Bulldozer did right was to address K10's two most serious bottlenecks, as described above. The retirement unit for each core was wider than the front-end, permanently solving mixed-latency instructions, and the FPU was substantially upgraded, capable of executing two multiply-adds, OR two multiplies, OR two adds per cycle - all in SIMD if required. For legacy code using separate multiplies and adds, however, the throughput could be merely equal to K10 at the same clock speed. Then AMD ruined it all by providing only one of these improved FPUs for two cores to share - and floating-point-heavy workloads also tend to be memory-heavy, so they ran face-first into the cache hierarchy.
So in theory, a heavy yet single-threaded FP workload, optimised using FMA instructions and with very modest memory-access requirements, could outperform K10 on Bulldozer. Anything else would find a bottleneck - *somewhere*. So shall we consider multithreaded workloads, which Bulldozer was allegedly designed for?
Straight away we notice that shared FPU, which eliminated Bulldozer's per-clock FP throughput advantage over K10, and the shared L1-I cache and decoders. Given a legacy workload, not using FMA instructions, this gives Bulldozer *half* the total FP throughput per core per clock as K10. By converting to FMA (the "unlocked capabilities"), we can get back up to parity. Big whoop. Any actual advantage here would have to come from higher clock speeds or core count.
Multithreading also makes the already-questionable L1-I cache in Bulldozer completely inadequate to sustain tolerable hit rates, unless by chance both cores are running the same code from the same location in memory. It also halves the decoders' throughput per core, so now they match that of the integer units we mentioned earlier, with nothing spare to keep the FPU properly fed. The wider retirement unit goes completely to waste in this scenario. Steamroller and Excavator have separate decoders (and are significantly better for it), but must still share fetch bandwidth and the L1-I cache.
As for the clock speed, this reached 5GHz if you are generous enough to include the absurd FX-9590 and count the maximum turbo clock speed. This is *not* 50% higher than the best K10 models, which could run all cores at their maximum speed all the time without violating a much more reasonable TDP - and on an older, larger process node to boot. Piledriver and Excavator did nudge north of 4GHz within reasonable TDPs, but that's not enough to overcome the other handicaps - and AMD had already given up on making Bulldozer "faster than the competition" by the time Excavator arrived.
The only trick left up Bulldozer's sleeve was core count. But we can directly compare Llano with Trinity - same process node, same core count, one with K10 and the other with Piledriver - to find that each Piledriver module takes up significantly more die space than a pair of hastily-shrunk K10s. Even after reducing the size of the iGPU (luckily it wasn't slower - they changed from VLIW5 to VLIW4 to save space), Trinity's total die size is larger than Llano's. We can extrapolate to say that simply shrinking K10 to 32nm should have allowed a Phenom II X8 within Vishera's die size, equalling the core count with much less R&D cost.
So from every angle, Bulldozer was a complete and utter failure,
even after making optimistic assumptions about software evolution.
By contrast, Ryzen achieves excellent performance on *current* productivity software, with remarkably few exceptions. No arguments about "performance potential" are needed in that context. Only in games, which are a rather different type of workload, is there a question-mark.