Isn't that want people want?! I know I want incredibly fast games.
It measures IPC that's the amount of instructions that get retired,a stall is nothing it executes nothing and nothing is being done so nothing is being retired.
A stall will show up in CPU utilization because although nothing has been done also nothing can be done in it's place it's a wasted cycle if that was the only thing executed.
And even if the lower IPC was due to stalls and contex switches how does that make it better?That's even worse because it proves very badly written code.
Besides form the question if PCM is capable of recognizing thread stalls...
That may be, but the 3d game with low IPC in your example uses a lot of floating point math.
And (Iam speculating) probably that game from 2002 uses a lot of integer math because the cpus in 2002 were less powerful and have small routines running from cache to avoid the slow front side bus used at the time.
So the question then comes how much cycles on average are needed for a given set of integer instructions and for a given set of floating point instructions.
I am willing to bet that for simple instructions like integer instructions, that these are ideal to reach an high ipc. And for floating point instructions it is more difficult because more than often dependencies arise.
It then also depends on the measurement being done.
Of course performance counters are used and instructions are counted. But how they measure the exact stream of instructions :
I am sure that in one of the sources here it is explained.
https://github.com/opcm/pcm
I am very curious.
How does PCM work exactly ?
How can a user appoint a certain given set of instructions in a thread ?
As a side note, John Carmack was a master in optimizing. If i am not mistaken he was very good at developing algorithms that mainly used binary functions like for example AND ,OR,XOR and shifting to get 3d integer functions that did the same as when done purely in floating point math.
https://en.wikipedia.org/wiki/Fast_inverse_square_root
It is not unlikely that many games during the 2002 timeframe made use of these features.
Now since there is a lot more brute force and the image quality has gone up enormously , the better accuracy of pure floating point is needed.
And even then, it is all about dependencies and what can run in parallel independently.
You really have to look then what load store and integer and floatingpoint instructions can be done at the same time. And this is for every cpu a bit different.