Ehh, I'm not sure that is true.
Intel has spent decades of engineering time and research into making their x86 processors as fast as possible. (In fact, I would argue that they may have hit the wall of what they can do with the x86 uArch). Sure, they had power constraints, but those were much wider than what the typical ARM processor would have (120W, Meh, put a bigger fan on it and power it up!)
I think that where ARM to shift their focus on performance rather than power consumption, they could probably make it into the Intel level of performance (with a couple of decades behind them). In some ways, I think it is possible for them to become faster just because the ARM uArch is slightly less crusty than the good ole' x86.
But, I can't say for sure how far it would go (it hasn't been done. They are still targeting sub-watt power consumption in most cases). The only way to know for sure how far they would go is if ARM decided "Hey, lets make an arm processor with a 65W power profile" and developed it through several generations.
We have had this discussion recently here. There is nothing inherent in the INSTRUCTION SET that ARM uses that makes it power friendly. It is ENTIRELY in the architecture.
Does the CPU use aggressive branch prediction? What is the latency/throughput of floating point operations? Do you have speculative execution, do you have out of order processing? What is the size of the caches? What is the set-associativity of the caches?
I'm sure that the engineers at ARM could develop an i7 processor, but it would take them 4-6 years of engineering work. Debugging a high-end OOP SMT superscalar speculative-execution multicore chip with three levels of cache is EXTREMELY hard, even if the general design concepts are well known. Perhaps ARM has a team doing that, but I doubt it. Intel has somewhere on the order of 6-10 CPU design teams working semi-independently on separate chips in parallel to support their multiple platform tick-tock style annual releases on multiple performance tiers.
I would also point out that the newest Atom matches ARM pretty well on a performance-per-watt basis.
I would also argue that Intel has slowed down their "brute-force" style performance improvements and have migrated gradually into more elegant ways (exploiting implicit parallelism, elegant SIMD, etc). Intel is simply up against information theory and quantum physics. Transistor die shrinks don't give the improvement you used to see, exotic tricks like branch prediction have already hit 98% accuracy and are reaching theoretical limits, cache designs are insanely complex to exploit every little drop of throughput. The Atom does some cool exotic stuff like zero latency load/store operations while at the same time simplifying the pipeline back to 486/Pentium era in some ways (in-order, no speculative ops) and still maintaining cutting edge technology (improved SMT, aka HyperThreading) that ARM doesn't match.
To some extent, each of these features was gradually iterated over time. Rhe original Pentium, one of the first widely used chips to have branch prediction and speculative execution, used a predictor that was ~93% accurate. The P3 was about 95% accurate. However, in the P4 vs Athlon days, Intel made an active decision to stick with the P5 branch predictor, presumably in order to save the transistors and apply them to adding additional ALU execution units and/or to deeper pipelines for clockspeed. AMD, instead, chose to use a more exotic branch prediction unit that was somewhat more accurate (~97%). History shows us that AMD's approach was correct and now Intel has adopted a similar BP algorithm as well. But in a low-power design, you might want to use the simpler prediction, in order to reduce power usage. The ARM A7 didn't even have branch prediction (though, the branch miss-latency was only 3 cycles). This is just one example of a hundred or so similar tradeoffs that are made for mobile chips.
how about discrete GPUs? How do they fare in this comparison?
It's a totally different type of processor, but one might argue that GPUs are actually the most advanced chips in your computer today. They certainly can be, by the die size and the transistor count, especially the number of actual processing transistors (vs L3 cache) built into the die.
An 8-10 core Xeon processor tops out around 3-5 billion transistors, a huge fraction of which is cache. nVidia's Kepler GPU is over 7 billion, with very little cache in comparison. However, to be fair, that is the result of parallelizing hundreds (or even thousands) of smaller processing units, so it all really comes down to "what do you mean by advanced?"