What odd redefinition of the term efficiency is that? Since battery capacity is limited, the most efficient CPU is the one which manages to do the most work until the battery runs out. This is highly workload dependent since e.g. hardware accelerated video is different from ST different from MT different from GPU different from different balances of mixed workloads. There is no mathematical way to simplify or generalize that calculation.
It is not true in general. It is only true in some designs.
Consider a design that uses very little power outside of the CPU, e.g. 1W. What if the CPU needs to run at 1GHz to use 1W but it can run up to 2GHz on the same voltage? Such a CPU will use 2W at 2GHz. Now you can run your CPU at 1GHz and the design will use 2W. Or you can run it at 2GHz and it will use 3W. Which one is more efficient?
(The above ignores the leakage which does not scale with frequency. I.e. such a CPU at 2GHz will use even less than 2W, depending on how much can be attributed to the leakage.)
Obviously, the frequency/voltage curve often means that you cannot double your frequency at only twice the power consumption. However, you cannot just say "it is mathematically provable that a design that uses half its power on the CPU is the most efficient". Or you can say it when you provide such a proof
There are far too many variables (frequency/voltage curve, leakage, ...) for there to be one single solution to this. It depends on the design.
This is getting silly. I prefer to lurk, but fine, I'll provide the math and then shut up again.
Consider a system where we divide power consumption into processor and "rest of system". For a task that takes a given amount of time, we can trivially write:
E(total) = (P(cpu) + P(rest)) * t
We want to examine the P(cpu) that minimizes total energy consumption, which in the case of a battery driven device like a phone or a notebook directly translates to increased battery life.
We will look at 3 regimes of CPU frequency-power. Linear, quadratic, and cubic. If you go look at actual power profiles, even the best case is usually at least somewhat superlinear, so linear is clearly a best case scenario. We will also assume that this task scales perfectly with frequency, which is also clearly a best case scenario.
Let us examine the middle case first, where scaling is quadratic. We can then rewrite P(cpu) in terms of same (arbitrary) power, A, and a frequency scaling factor, x. The time taken for our task must be scaled as well. We will call P(rest) B to make the equations somewhat simpler visually.
E = (A*x² + B)*t/x
where P(cpu) = A*x²
Finding minimum, we'll differentiate and look at the roots:
dE/dx = (A - B/x²)*t
=>
x² = B/A
x = sqrt(B/A), given our non-negative constraints.
P(cpu) for minimum power consumption:
P(cpu) = A*x² = A*B/A = B
I.e. at quadratic scaling, the most efficient (least energy consuming) case is where the processor matches the rest of the system in power, as stated by Abwx.
There are a few additional observations, but we'll deal with linear and quadratic scaling first. Feel free to do the math, I'll simply note the results.
For linear scaling, the processor uses the same amount of energy for the task regardless of frequency (it just takes longer if clocked lower). Total system energy is obviously minimized by finishing as fast as possible. In this very idealized regime, race-to-idle is indeed best.
For quadratic scaling, plugging in x³ instead of x² above gives a minimum at:
x³ = B/(2*A)
corresponding to the most efficient P(cpu) of:
P(cpu) = B/2
What we are seeing here is clear. For sub-quadratic scaling, we want the processor to finish as quickly as possible, while we want the power to be equal to the (rest of the) system power at the quadratic limit, and lower than system power for superquadratic scaling.
At this point I need point out a few assumptions:
1. Performance scaling is very rarely linear and any sub-linear scaling will push the processor towards lower clocks (and thus lower power), even in the linear power scaling regime.
2. The concept of isolated task-energy is *usually* completely wrong when looking at real world usage patterns. Outside of things like compilation or rendering tasks, the system is in use before and after the isolated task. In this case, the equation should be:
E(tot) = (A*x² + B)*t/x + (A0 + B0)*(T-t/x)
This is the relevant equation when looking at how long your phone or notebook will last while browsing, watching a movie, programming, using a word processor or anything else that can be considered a "base load" (even if not idle) and looking at what the most efficient power consumption for the processor should be while spiking to do whatever Windows is up to (or any other non-critical task that would slow down what you are doing). A0 and B0 in this case is the base load of the CPU and "rest".
Solving that gives this for most efficient P(cpu):
P(cpu) = B - (A0+B0)
I.e. the processor should consume about as much power as the rest of the system at load MINUS the idle (or base) power of the system.
The conclusion to all of this is that race-to-idle is very often a very bad choice, at least for non-interactive/blocking tasks.
It is quite easy to plug in real numbers for real systems and the results are, in my experience, quite close to Abwx' postulate of most-efficient being when processor power is close to rest-of-system power.