Zen can do 4 int PLUS 4 fp per cycle. SKL can do 4 int OR 3 vec int or 2 FP or any combination up to 4 uop/cycle. 8 uop/cycle versus 4 uop/cycle for 2 threads. How come that Ryzen SMT gain less than INTEL's?
Because you have to
SUSTAIN that cycle after cycle for it to make a difference, both can only decode 4 x86 ops and most x86 ops are 1 uop for both. So for all these extra ports to matter you have to be able to feed them and neither Zen or Skylake can Feed more ~6 uops from uop-cache or 4 from Decode.
They also both have the ~same amount of L/S and all the other structures i mentioned, when you have FP workloads for example you will see very high percentage of ops being Loads or Stores, with two threads that will bottleneck both Skylake and Zen before port congestion.
Now you need to find me this workload that is both scalar and SIMD heavy concurrently has a minimum ipc of 2 and is the perfect fit for SMT without bottlenecking the L/S system.
The perfect example of why all these theoretical super high cocurrent port usage doesn't matter is actually 256bit AVX SB/IB vs haswell. Both have the same amount of execution width but haswell is significantly faster because those FP heavy workloads because for 256bit ops it has twice the load and store bandwidth,
So answer me how is Zen going to
SUSTAIN 8 128bit reads and 4 128bit writes a cycle when it can only get 2 reads and 1 write a cycle, its very common to see FP workloads with >50% of operations being loads or stores.
most of the stack uops are deleted from the stream by the stack memfile, moreover the decoding is broken into 2 parts: the high level that translated almost all the x86 instructions into one microop, INCLUDING microcoded instructions, that occupy one slot in early stages and uop cache (contrary to INTEL) and are expanded just before dispatching. The uop cache is bigger and not bloated by microcoded instructions as they occupy only one slot. Moreover we have 10 uops/cycle executable for two threads, plus those executed in the stack/memfile stage, that does not consume any ROB/PRF/queue/cache ports resources, as they are resolved earlier. Finally jumps: Zen can do always 2 jumps/cycle. SKL/KBL can do 2 only if the second port is not occupied by an FP or vecint instruction. Zen does not have this problem.
You dont need to tell me how an X86 processor works, you also do not have, 10uops a cycle. you have 6. upto 6 to int and upto 4 to FP.
edit: before you try to claim its additive to 10uops please explain then why Micheal Clake says they have wider retire then dispatch because it helps to clear out the retire queue and get more instructions in flight, is he lying?