re: R20, Golden Cove did increase the vector units to 3 from 2. Between that and the DDR5 bandwidth increase you'd think R20 would be way faster. Don't ask about power, especially at 5 Ghz.
Sunny Cove already had 3 vector ALUs, it was able to pump 3x256 ALU type instructions per clock. I think most of the Cinebench MT prowess instead comes from:
1) Big cores can do 3x256 loads per cycle, can execute a wide mix of ALU, vector ALU per clock. As Zen3 investigation found: on ZEN3 IPC of Cinebench R23 is 1.41 and it is bound by backend resources 23%, so more execution capabilities go long way enchancing performance. And i suspect Intel is adding fast FADD unit cause it helps in Cinebench style workloads.
2) Small cores are actually beastly and offer the same 3 vector ALUs, even if one is limited to ALU operations only ( think something like VPANDX but not VADDXX or VMULXX ). And all that is backed by 2 load + 2 store per cycle. Question is how wide it is, but remember ZEN3 can only do 2x256 load + 1x256 store per cycle.
Intel has returned to doing the sane things and is turning its back to FMA crowd ( the two of them who are running Linpack and prime95 all day ) that are ruining performance for normal people. Skylake has degraded latency of simple FP add / mul instructions from 3 to 4, and even if throughput is good, latency still matters. Small Atom like Tremont in fact had 3 cycle latency FP add, when big core had 4 cycle.
Since everything in floating point world is executed on what Intel calls "vector" units, even if we are talking about simple, not vectorized floating point variables (float x; double y) - they are loaded in 128bit XMM registers and instructions like ADDSS / ADDSD and MULSS / MULSD are executed.
So looking at resources "small" core has - it can in fact match Skylake in throughput and beat it in latency for those small ops while also having
additional FP/VEC port for ALU operations. So it already starts the game with more execution resources than Skylake and is more similar to Sunny Cove, than Skylake.
And the funny thing is, since we are talking about separate execution ports for FP/VEC, it means that additional
four integer ALU ports are free to do operations, unlike on Skylake/Sunny Cove where PORT0 / PORT1 are overcrowded with hardware and once busy with FP/VEC, they are not available. For example Skylake/SNC will have just one Shift ALU available for variuos operations, while Atom has 4 to choose from; while just one integer multiplier unit is available and zero divisors, Atom can choose from 2 ports to do these ops.
I think the only real bottleneck with so many ports is gonna be 5-wide allocation to feed so many ports, if they had 6-wide allocation like Skylake, they would be matching Sunny Cove instead. Next generation of Atom is gonna be exciting, even if current one is good for marketing Cinebench numbers only.