(snip)
These are the instructions for a simple printf program. I'm sure you don't need this lesson as you seem informed. Now extrapolate this to a benchmark application.
Analysis of an execution flow :
https://en.wikipedia.org/wiki/Cycles_per_instruction
I like what you write.
I have a good concept of how things move through the pipeline and often count certain instructions in my code to get an idea of how efficient the code is.
The above examples, while great as abstract examples for learning, do not represent real world code. No one would ever measure a program like hello world. It basically does some setup then makes a system call.
As well the other example is loaded with dependencies, (some designed to confuse), is in order, and simplified. I do not believe any modern processor could return a dependance in 1 or 2 cycles. I would think it would be more in the order of 7-14.
So here is what I do. In my current project I'm looking at better ways of transposing a matrix of bits for faster output to the GPIO. (aka bit banging) It will be open sourced so I don't care posting some snippets here.
This is a major routine. It takes matrix, interleaves the bytes from the middle over and over. The result is a transposed matrix. There are more direct ways of doing this but this method is very friendly to a pipelined arch.
Code:
void InterleaveBytes(__m128i *in, __m128i *out, __m128i *scratch, long count, unsigned long passes, unsigned long offset)
{
__m128i *to, *nextTo;
__m128i *from = in;
if (passes-- & 1) { // odd
to = scratch;
nextTo = out;
} else {
to = out;
nextTo = scratch;
}
__m128i *end = &to[count];
do {
*to++ = _mm_unpacklo_epi8(from[0], from[offset]);
*to++ = _mm_unpackhi_epi8(from[0], from[offset]);
from++;
} while (to < end);
from = &to[-count];
to = nextTo;
do {
end = &to[count];
do {
*to++ = _mm_unpacklo_epi8(from[0], from[offset]);
*to++ = _mm_unpackhi_epi8(from[0], from[offset]);
from++;
} while (to < end);
end = &from[-(count>>1)];
from = &to[-count];
to = end;
} while (--passes > 0);
}
So typically this routine would loop 4-5 times on a matrix size of 128 bytes. For loops like this most of the instructions are superfluous. As long as you feed the major instructions at a decent rate they will be the determining factor.
On a haswell 4ghz
Code:
/* 40+128=168 non loop operations
10m passes comes in at
0.48 sec so 10m/0.48 = 21m
4g / 168 = 24m
87% efficiency
*/
This first routine is the 40. There is a more complex routine that shifts out the bits. The 128.
I now have a metric of how certain instructions are moving through the system. It is ordered, repeatable, and logical.
The same can be said for almost any routine in the abstract. Yes there could be better terms for what is going on, but IPC is a good term that people understand.
In the pure sense, you are correct.
However, I believe there is room for many contexts for the term IPC.
Edit: The routine for AVX is a bit more complicated because of the 128bit lanes. Here is the results for it though. It almost doubles the throughput at a lesser efficiency. It has less IPC. In this case your argument against IPC shows validity.
Code:
/* 24+64=88 non loop operations (sans byte bit swap)
10m passes comes in at 0.3 sec so
10m/0.3 = 33m
4g/88 = 49m
67% efficiency
*/