The perf/MHz are pretty similar for reasonable tests.
And when scaling up it goes bad for A9X. Its easy to make high IPC, its hard to make high IPC and high clocks.
The amount of ILP that Twister extracts at the frequency it does was not an easy task. It benefits from design decisions to sacrifice some frequency vs something like Skylake. But it also suffers from decisions to emphasize power efficiency both in the CPU (many latencies can be improved at the expense of power efficiency) and out (for example, LPDDR that is higher latency)
Twister has exceeded most speculation I've seen regarding how far Apple could revise the uarch in this generation. They heavily increased clock speed while decreasing many instruction and cache latencies, improving IPC at the same time. We don't really know how far they can push things but with an explicit focus on higher performance and less efficient desktop designs I expect they could come pretty close to Intel.
A dual core GT2 is what, 89mm2? And that's with PCIe links too. The cores themselves are what, 8mm2 each? So that's pretty much equal for the 2.
The L3 cache is 4MB on the compared products. But the L2 is 256KB. I thought you knew a L3 would be slower due to latency. Memory is 29.8GB/sec contra 51.2GB/sec. The A9 due to its slower memory got a 4MB L3.
Brute force can lead many places when money is of no concern.
The big extra bandwidth on A9X or the L3 cache on A9 doesn't help very much with CPU performance in most programs. You can see this in going from dual to quad channel with Intel as well. It's there for the GPU. As is a lot of the rest of the large die size. Which really can't be compared with Intel chips that need a separate south bridge, lacks a lot of the mobile-focused peripherals that Apple uses, and uses a GPU that doesn't trade in space for power efficiency to the same extent. You wouldn't call Crystalwell a brute force CPU improvement tactic, would you?
Apple has gotten very smart with their CPU designs and it feels like you really want to downplay this.