Intel has returned to doing the sane things and is turning its back to FMA crowd ( the two of them who are running Linpack and prime95 all day ) that are ruining performance for normal people. Skylake has degraded latency of simple FP add / mul instructions from 3 to 4, and even if throughput is good, latency still matters. Small Atom like Tremont in fact had 3 cycle latency FP add, when big core had 4 cycle.
Since everything in floating point world is executed on what Intel calls "vector" units, even if we are talking about simple, not vectorized floating point variables (float x; double y) - they are loaded in 128bit XMM registers and instructions like ADDSS / ADDSD and MULSS / MULSD are executed.
So looking at resources "small" core has - it can in fact match Skylake in throughput and beat it in latency for those small ops while also having additional FP/VEC port for ALU operations. So it already starts the game with more execution resources than Skylake and is more similar to Sunny Cove, than Skylake.
And the funny thing is, since we are talking about separate execution ports for FP/VEC, it means that additional four integer ALU ports are free to do operations, unlike on Skylake/Sunny Cove where PORT0 / PORT1 are overcrowded with hardware and once busy with FP/VEC, they are not available. For example Skylake/SNC will have just one Shift ALU available for variuos operations, while Atom has 4 to choose from; while just one integer multiplier unit is available and zero divisors, Atom can choose from 2 ports to do these ops.
I think the only real bottleneck with so many ports is gonna be 5-wide allocation to feed so many ports, if they had 6-wide allocation like Skylake, they would be matching Sunny Cove instead. Next-generation of Atom is gonna be exciting, even if the current one is good for marketing Cinebench numbers only.