RE: AVX, that actually makes sense -the uop, retire, and store queues are statically shared but rarely used for one work load as much as they can be with SIMD. AVX, in particular, could be like issuing twice as many uops for the same task as normal, making the penalty come to light in a very clear manner.
Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would then be the load and store queues.
This could then lead to situations where both threads are stalled, waiting for memory, but a single one with a larger uop queue would not be. For this to become significant, a hot path would have to cause this issue regularly, which should be noticable by lower power consumption (and corresponding values in the perf counters).