They claim it's energy reduction, but I think that's a side effect of the real goal: Higher clocks.And still x64 decoding takes longer so they add more decoding pipeline stages to break the work into multiple steps and add micro-op cache to reduce the energy cost associated with repeatedly decoding the same instructions into micro-ops. Some ARM designs have a micro-op cache but apparently not the fastest ones.
Conroe: 14 stages
Nehalem: 16 stages(due to Turbo)
Sandy Bridge: 14-18. 14 on a hit, 18 on a miss. Uop cache itself adds extra 2 stages, and 2 stages more on a miss. SNB had even better Turbo.
So they are trying to get 18 stage clocks with 14 stage performance. Remember the root of the uop cache idea is Trace Cache. Willamette and Northwood had 1 decoder and relied almost entirely on TC for decode. So performance is not just about averages but reducing worst-case scenarios and missing TC would mean 1-wide decode.