Some real nice conspiracy theories here
But reality is that MS compiler is generating code just like that in 32bit mode with default switches. I doubt you can blame AT, when pros like Bethesda released Skyrim compiled with disastrous switches (I know cause I was involved in Skyboost project, where Skyrim exe was pached in memory to replace hot x87 with SSE2 code and handtuned SSE4/AVX code). So there is plenty of code like that, you need to care very much, profile even more, something that is not real for each project.
Back on topic of 3DPM - it involves x87 trigonometry instructions. That is usually slowest part of any loop involving them. The are micro coded in CPUs ( like broken down into dozens of operations required to calculate trigonometric function with required precision ) and that means little OOP opportunities and lots of scheduling pain inside (cause that fcos might be executing some mini loop inside, that needs to be tracked etc).
HT is perfect fit here, cause CPU execution ports are underutilized and 2nd thread can make progress by tracking more broken down micro coded ops...
It really puts AMD in bad light, cause it ends up benchmark for money put into x87 microcode quality and scheduling, but it's real world benchmark, there are plenty of programs with code like that getting executed.
But where performance really matters (like getting results out in time before sci paper deadline ), noone will use code like that, Intel has some amazing stuff in their libraries where they use SIMD instructions, or GPU will get used to pump even more values/s.