This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.
By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.
Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*
You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?
*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.
This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.
It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.
Many people try to find a simple explanation of why one chip is faster than the other one.
Some try to explain it by the ISA and make a lot of stupid claims regarding RISC vs. CISC. Other ones try to explain it using the decoder width: just make the decoder wider, and that's all. From time to time, I have seen speculations about the L1 cache size and latency. A long time ago, there were also speculations regarding the pipeline length, especially with the Netburst launch...
The hard truth is that there's no silver bullet. When you design an architecture, you need to find the right balance to meet the requirements and restrictions.
Let's take Apple P-cores as an example. Initially, they were designed for mobile devices with lower clock speeds. Apple has to make them wide. The P-core in Apple M-series chips, for example, was designed to execute 8 uops per cycle on average. This approach requires much more complex structures, takes more area, and is more expensive. It offers better efficiency on lower frequencies but scales pretty badly on higher ones.
As a result, the core of the Apple M4 running on 4.5 GHz consumes more than twice the power of the M1 (7.21W vs. 3.43W). Actually, we are looking at power numbers similar to those of AMD Zen 4 running on 5.2-5.4 GHz and on the much older node.
Intel and AMD use different approaches. They are running on high clocks already but process ~6 uops per cycle on average in the current generation (Zen 4 and Raptor Cove). The most obvious step is to widen the execution. Skymont and Lion Cove are targeted to execute ~8 uops per cycle on average (nearly the same amount as Apple Silicon) while running on higher clock speeds.
I think we will see something similar with Zen 5. Probably, they increased the average throughput from 6 to 7 uops to get a 16% IPC boost.
Apple looks pretty confusing here because you can't increase the frequency indefinitely. We had that case many times before with Intel (Netburst, Skylake, Alder Lake). At a certain point, you have to introduce some major changes to the architecture. M4 is technically another refresh of M1 (M1+++). The case with Apple is even more complicated because they use the same P-cores for their phones, tablets, laptops, and even Mac Pro. They all have different priorities.