In the past I've written a series of articles (in Italian, but with Google Translate you can have a good enough translation) which talked about x86/x64 statistics, collecting data with a Python script which disassembled as much instructions possible (around 1.7 millions in the used samples), so I've some knowledge about the argument.
Take a look
here (where I report numbers about the used operands): you'll find that on the public beta of Adobe Photoshop CS6, the number of instructions with only the REG operand drastically decreased on the x64 version, compared to the x86 one.
The reason is easily explained looking at
another article (where I report numbers about the used mnemonics): the total number of PUSH and POP instruction is reduced to 1/5, due to big ABI change and the extensive use of registers for passing function parameters, instead of pushing them to the stack.
The number of instructions referencing stack variables had some reduction also, but of the same order of magnitude, because you still use LEA instructions for generating pointers to the referenced variables, which increased in the x64 version.
But such kind of instructions are pretty rare. From the second link that I provided, you can find this:
Code:
Adobe Photoshop CS6 32 bit (PS32):
Mnemonic Count % Avg sz
MOV 602130 34.48 3.9
PUSH 257768 14.76 1.7
CALL 126675 7.25 4.9
LEA 121033 6.93 4.2
J 110954 6.35 2.9
POP 78536 4.50 1.0
CMP 68943 3.95 3.4
ADD 59819 3.42 3.0
Adobe Photoshop CS6 64 bit (PS64):
MOV 642687 36.99 5.0
LEA 186105 10.71 5.8
J 132638 7.63 3.0
CALL 131855 7.59 5.0
CMP 77335 4.45 4.0
ADD 53417 3.07 4.1
As you can see, the ADD instructions are a bit more than 3% of the total.
But if you look at the statistics of mnemonics combined with the used operands, the situation is even worse:
Code:
Adobe Photoshop CS6 32 bit (PS32):
Mnemonic Count % Avg sz
PUSH REG 187508 10.74 1.0
CALL PC 116761 6.69 5.0
MOV REG,[REG+DISP] 112738 6.45 3.5
J PC 110954 6.35 2.9
MOV REG,REG 89188 5.11 2.0
POP REG 78535 4.50 1.0
PUSH IMM 69388 3.97 3.6
LEA REG,[EBP-DISP*8] 62624 3.59 4.2
MOV REG,[ESP+DISP*8] 58744 3.36 5.7
MOV REG,[EBP-DISP*8] 51989 2.98 3.8
MOV [EBP-DISP*8],REG 46840 2.68 3.8
ADD REG,IMM 43279 2.48 3.1
Adobe Photoshop CS6 64 bit (PS64):
MOV REG,REG 136358 7.85 2.9
J PC 132638 7.63 3.0
CALL PC 121294 6.98 5.0
MOV REG,[RSP+DISP*8] 117040 6.74 7.0
MOV [RSP+DISP*8],REG 108514 6.25 5.8
MOV REG,[REG+DISP] 71937 4.14 4.7
MOV [REG+DISP],REG 56648 3.26 4.7
LEA REG,[RSP+DISP*8] 56141 3.23 6.4
LEA REG,[REG+DISP] 55826 3.21 5.4
POP REG 47358 2.73 1.4
TEST REG,REG 46419 2.67 2.6
LEA REG,[RIP+DISP] 36053 2.08 7.0
MOV REG,IMM 34042 1.96 4.9
XOR REG,REG 33555 1.93 2.5
ADD REG,IMM 31705 1.82 4.4
So, the most common form of the ADD instruction doesn't use a memory operand, but a register (with an immediate).
I only reported the statistics about the public beta of PS CS6, but I did the same operation with many other applications (MySQL, FirebirdSQL, MAME, Write, Crysis 2, etc.), and the situation is more or less the same.
This is what Agner reported for the Ops column:
"Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode."
So, it should be the decoder which does the task of generating two macro-ops.
But, of course, if there are other juicy information, they are welcome.
You're right: the column next to the Ops one was reporting the latency. Sorry for the mistake.
So, I can only confirm that 2 macro-ops are generated for 256-bit AVX instructions.
Another important information which is reported is the Reciprocal Throughput, which is usually double for such kind of operations, but it's perfectly normal / expected: they need more time for being processed by the execution unit.