Going from 1x9 throughput uopcache to 2x6 throughput uopcache is also an SMT minded decision, because with ST, it would be a downgrade.
As far as I understand, entries in the uop-cache contain up to N consecutive instructions starting from a given address, but it may contain less instructions if:
- An instruction crosses a cacheline, or;
- There is a branch in the instruction stream, or;
- The current entry doesn't have enough space to hold an instruction that is decoded into multiple uops (probably rare though).
So if the new uop-cache can, when operating in 1T mode, retrieve two consecutive entries or entries across branches, then it's actually a win in every case. And it should be quite doable since the BTB already has the target address of the next branch.
I agree that this probably has greater impact in SMT, though. I hope someone figures out the chicken bits for these features so we can eventually compare them in different workloads.