-The small core decoders directly use most of the x86 instructions rather than changing them to internal instructions - continuation of what the original Bonnell Atom did.
-Sunny Cove increases the L1 data cache, while in Gracemont it'll increase the L1 instruction cache. Likely the L1I increase helps with the doubled decode.
-Starting from Tremont, it uses dual cluster decode which according to the chief architect it saves space in comparison to the uop cache.
-Starting from Goldmont it also has a predecode cache. A massive 64KB on Goldmont Plus.
Increasing L1i is a must to improve decode rate. Same with predecode L2 cache, a must for such architecture.
uOP cache ( as done by Sandy+/ZEN, not P4 ) while complex, it saves energy by not having to decode over and over again, at some point those savings + not having to have those massive decoder supporting structures wins over.
I feel like big part of Core bloat comes from sizing of various buffers: ROB, int PRF, FP PRF, uOP cache, branch prediction unit buffers, multi level I/D TLBs, TLBs that have variuos page sizes, store and load queues.
Atoms used to cut corners in all those structures and is making great use of diminishing returns of their sizing. For example large pages in TLB? Only added in Goldmont? 3 decoders? Also recent addition.
Combine these cuts with obvious cuts to execution resources, vector size support and caches sizes, ports and path widths => tiny die area is the end result.
I expect Gracemont cores to be substantially bigger than Tremont, but they'll still be barely over 1mm2. The Tremont cores are like 0.7mm2. Sunny Cove core is 5-6x the size of Tremont, not 3-4x.
Atom is more like "dial transistor budget to get performance targets they need" product. It's not like in 2014 it was a secret to them that having iTLB with large page support would help performance, or design of such TLB was beyond their capabilities. it was conscious decision to forgo better performance for die size savings.
What I don't share with You is optimism at performance targets. Honestly that hybrid 1+4 cpu was a disaster and that is understatement already. And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot. But look at disastrous performance compared to other mobile CPUs in for example Cinebench R15? R20?
So Intel's Alder Lake 2xMT performance is very likely a pipe dream on 10nm, probably based on comparison with some hilariuos wattage constrained mobile CPU.
The desktop reality will be ~11 Golden Coves in Cinebench and waaaaaaay behind AMD 16 core cpus. And since it will struggle in poster childs of linear scaling, memory not-touching MT loads, the best way to use it will be disabling those Atom clusters and not having to deal with scheduling problems.