Linking to a different performance issue with non-temporal memory access just seemed to be asking for misunderstandings, and apparently also led to one. At least i can't see anything suggesting that AotS is compiled with intel's compiler.
I'm also interested in a source on the false dependency...
I'm not quite sure i can follow. Why wouldn't it be possible to dispatch one complete avx op (2 uops) and half of the next one (+ 1 = 3 uops)? Assuming fetch/decode is fast enough, there should be enough buffered, and if not, the wider dispatch wouldn't have helped anyway?
Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would...
Intel also uses statically partitioned buffers according to the Intel 64 and IA-32 Architectures Optimization Reference Manual chapter 2.6.1. AMD actually shares more than Intel, namely the load queue and the ITLB.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.