I wonder why/how two instruction decoders are preferable to having a micro-op cache in a design focused on efficiency. They state a cache is not deemed necessary without performant AVX? I guess it's actually not an either/or, and the two decoders can be seen as a more flexible split bigger decoder, see later quote: "Predecode info cached to avoid variable length decoding. 2×3 config reduces length decoding power/area when it’s needed"
Micro-op cache makes sense in big core designs where it needs to perform high per clock and clock close to 5GHz. The die and power overhead of the uop cache is small in light of the gigantic uarch.
But note that in E core designs the table turns. Uop caches store decoded instructions so a 4K instruction is actually a multiple of that size. Also, there is a pipeline penalty in case of a miss.
Intel went from 16 stages in Nehalem to 14-19 stages in Sandy Bridge, where the worst case scenario adds 2 stages and best case scenario allows skipping 2. In this case Intel is essentially using the uop cache to increase clock speed using more pipeline stages while minimizing the perf/clock penalty.
Adding pipeline stages add quite a bit of complexity in reality in addition to noticeably reducing performance. In order for the uop cache hit rate to be high it'll have to be large. So by avoiding the uop cache they avoid the pipeline penalty and the comparatively large area needed for it. It's a multi-faceted design decision that takes into account all parameters.
Gracemont should be a 13-stage pipeline, which means compared to the big cores it has a rather significant advantage due to less branch mispredictions which in turn means more efficiency.
Intel went for clustered decode approach because straight up increasing from 3 to even 4 wide issue increases complexity quadratically, and some argue even exponentially. So going from 3 to 4 may be 60% increase in decode area and power.
The clustered decode means it can essentially double the issue rate without paying for the area penalty. Intel claims close to linear scaling in area/power. And it seems based on that article Gracemont can reliably hit 5-issue rate which is quite fantastic.