AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Dresdenboy · Oct 7, 2016

itsmydamnation said:
So the question is on average what percentage of loads are for the stack, because AMD doesn't use the AGU to calc its address, so its completely plausible that >haswell and Zen issue on average the same amount of loads and stores a cycle.

from "Revisiting Stack Caches for Energy Efficiency", Olson et.al.

bjt2 · Oct 7, 2016

So more than 33% of mem ops imply stack and should not occupy an AGU slot. So the AGU requirement is at least 33% less. 2 AGUs for 3 mem ops is enough. Even at peak.

Dresdenboy · Oct 7, 2016

bjt2 said:
So more than 33% of mem ops imply stack and should not occupy an AGU slot. So the AGU requirement is at least 33% less. 2 AGUs for 3 mem ops is enough. Even at peak.

One thing to be considered is, that these average values include execution phases with few or no stack accesses, e.g. DGEMM. But we might as well look at sustained AGU occupation rates.

bjt2 · Oct 7, 2016

Dresdenboy said:
One thing to be considered is, that these average values include execution phases with few or no stack accesses, e.g. DGEMM. But we might as well look at sustained AGU occupation rates.

One should only know if the mem operation rate that require the AGU is greater than 2 uop per cycle.
For two threads of course.
Even if load/store instructions are 50% of the instructions, with IPC that rarely is higher than 2, there is at most 1 instruction per thread per cycle that requires the AGU...

cdimauro · Oct 7, 2016

Abwx said:
In these two slides it s written neither uops nor micro ops but just ops, so one more time you re not even reading accurately in the slides you re posting...

I think you need a good pair of glasses, since in BOTH slides it's clearly evident that the Op-Cache sends micro-ops, and NOT ops, to the Micro-op Queue.

As said ad nauseam that s the case, it s just that the fact that it s possible seems to be highly inconvenient for you, Hardware.Fr journalist state that it s the case and that it s 2 uops/cycle more than Intel s Haswell 8 uops/cycle..

http://www.hardware.fr/news/14758/amd-detaille-architecture-zen.html

"Deux de plus que " mean "two more than".....

I'm Italian, so I can also catch some French word, but the point is another one, and it's also clear now that you haven't understood what I've written before, recapping what's the real situation.

The problem is that the slides are completely misleading, since they report wrong information. All three (see my previous post) are completely wrong, so I have no difficulty to understand that the guys at hardware.fr, as well as here at anandtech, have written incorrect data about what really happens.

Just to recap:
1) the Decoder unit does NOT send "4 instructions/cycle" (first slide) or "instructions" (second and third slide) to the Micro-Op Queue. Instead, it sends 4 micro-ops/cycle to it;
2) the Micro-Op Queue does NOT send "6 uops" to the Dispatcher (first and second slide). Instead, it sends 6 micro-ops to it;
3) the Dispatcher does NOT send "6 Micro-ops" to the INT unit + "4 Micro-ops" to the FP unit (third slide). Instead, it sends 6 ops + 4 ops.

Understood now?

bjt2 said:
Anyway as already said, AMD ops are pretty dense and are split AFTER the dispatch, only near the exe units...
If also 256 bit ops are treated as one up to the retire unit, then this is an advantage versus old AMD archs...

On Steamroller ALL 256-bit instructions are decoded in 2 "macro-operations" (according to the Agner Fog's instruction table) and are split in 2 or more (usually) "ports" / "execution units". Even register-to-register instructions use 2 macro-ops. Instructions using a memory read usually need 3 ports, and instructions which do a memory write usually need 4 ports.

Last but not least, the strange thing is that VADD/VSUB/VMULs instructions use 5 or 6 ports. And even more strange thing, is that the 128-bit instructions also use 5 or 6 ports.

So, it might be that Zen follows the same path.

bjt2 said:
I think also that instructions of type MOV [AX],BX does not require AGU because there is nothing to calculate to get address...

Such kind of instructions always use an AGU on Steamroller.

cdimauro · Oct 7, 2016

itsmydamnation said:
No this is 100% wrong, there are load and store QUEUE's after the AGU's. You can issue 2 load and a store a cycle as clearly said by michael clark @ 12:39 into the hot chips presentation.

Please, re-read what I've stated: I know it, and I've reported such information.

So, I repeat again: to access the 2L+S units you have to pass through the AGUs, which are TWO.

BTW, AMD talks about TWO Load/Store units:

You are also completely ignoring the stack engine ( which is used to do address gen + unknown stuff for stack address/data).

Yes, it works on address generation to save some calculations, and I think also to optimize function calls & returns.

But it has NO access to the memory. Take a look at the other slides.

Also you keep saying Intel has 4 L/S ports, they do not, they have dedicated store AGU ( does not move data!) a dedicated store port (doesn't generate addresses!) and two generate purpose AGU's that can each do a load. So aggregate is 3!

Again, please re-read what I've written: I've already talked about it! I've already stated that there are 2L+S units! I've already stated that one port is just dedicated to address calculation!

So the question is on average what percentage of loads are for the stack, because AMD doesn't use the AGU to calc its address, so its completely plausible that >haswell and Zen issue on average the same amount of loads and stores a cycle.

AMD also uses the AGU for address calculations. Take a look at Agner Fog's Instructions Table (the link was posted before).

cdimauro · Oct 7, 2016

Dresdenboy said:
This discussion flows at a faster pace than I have time to add something useful to it.

For me to. I've to exit with my wife for dinner, and now I've just some minutes to write something.

However, let me pick something out again.

This setup looks similar to the K12 one, just that the latter has one more AGU to feed the queues.

With the stack engine and stack memfile in mind (which seems to be used to forward stored stack data to the subsequent loads, as there is also a dependency table), there might be a reduction of about 20-30% of mem ops needing the AGUs. Further as AMD's complex ops (Cops in construction and cat cores, also called "compound op" or "complex micro-op" at your employer ) can even contain a RMW op, this should save one AGU issue slot. OTOH these instructions just make up some lowish percent (3%?) of all ops. That discussion took place at SemiAccurate a while back.

The stack engine will for sure help for saving address calculations, but I think that it's related to function calls & returns.

Saving memory reads or writes might be possible, but it complicates a lot the things, due to memory disambiguation problems.

Se also below, to the other reply, regarding this argument.

For your speculation about a possibly reduced AGU pressure for adjacent addresses (e.g. 256b AVX) I didn't find any hints so far. In a few months we might finally know.

I hope: the lack of information is a bit frustrating. :|

In the end, those 2R might just be needed to provide load data as fast as possible without being interfered with stores. And the third AGU in combination with the stack handling stuff might have been cut due to cycle time (more PRF ports) and/or power efficiency reasons.

It might be, yes.

Dresdenboy said:
from "Revisiting Stack Caches for Energy Efficiency", Olson et.al.

Consider that this might happen to IA-32/x86 code, which has only 8 registers, and so the stack is used frequently.

But on x64 you have 16 registers, and the ABI is changed as well, putting as much possible data (parameters on function calls) into the registers. This reduces A LOT the stack operations (PUSHs, POPs, MOV from/to [BP/SP]).

Tuna-Fish · Oct 7, 2016

bjt2 said:
So more than 33% of mem ops imply stack and should not occupy an AGU slot. So the AGU requirement is at least 33% less. 2 AGUs for 3 mem ops is enough. Even at peak.

The main problem with this is that stack/non-stack memory operations are not evenly distributed. Imagine a very simple loop that loads two values, sums them, writes out a value, then moves all pointers up by one. This loop would have no stack operations.

bjt2 · Oct 7, 2016

cdimauro said:
On Steamroller ALL 256-bit instructions are decoded in 2 "macro-operations" (according to the Agner Fog's instruction table) and are split in 2 or more (usually) "ports" / "execution units". Even register-to-register instructions use 2 macro-ops. Instructions using a memory read usually need 3 ports, and instructions which do a memory write usually need 4 ports.

Last but not least, the strange thing is that VADD/VSUB/VMULs instructions use 5 or 6 ports. And even more strange thing, is that the 128-bit instructions also use 5 or 6 ports.

So, it might be that Zen follows the same path.

Such kind of instructions always use an AGU on Steamroller.

I remember some of these problems on older archs... I was hoping that Zen corrected some of these minor flaws... If it get +40% IPC even without correcting these flaws, when they will be corrected, maybe in Zen+, we can expect some other gain...

bjt2 · Oct 7, 2016

cdimauro said:
For me to. I've to exit with my wife for dinner, and now I've just some minutes to write something.

The stack engine will for sure help for saving address calculations, but I think that it's related to function calls & returns.

Saving memory reads or writes might be possible, but it complicates a lot the things, due to memory disambiguation problems.

Se also below, to the other reply, regarding this argument.

I hope: the lack of information is a bit frustrating. :|

It might be, yes.

Consider that this might happen to IA-32/x86 code, which has only 8 registers, and so the stack is used frequently.

But on x64 you have 16 registers, and the ABI is changed as well, putting as much possible data (parameters on function calls) into the registers. This reduces A LOT the stack operations (PUSHs, POPs, MOV from/to [BP/SP]).

If you consider that most code is written in C/C++ and all local variables and function parameters (on 32 bit code) are put on the stack, complex code with much local variables, maybe little static arrays, declared as local variables, that can't be put in a register, you can see that much data is on stack. And Dresdenboy put a graph: about 40% of memory operations, in at most some software, is on the stack... We are talking of push/pop for frame adjusting and local variable accessing. Frame adjusting is also frequent for the good programming practice of doing many small little subfunctions, instead of few big functions, for code maintenability... So there are many small functions, each of which require the stack frame creation and destruction, and local variables access. Dynamic languages require heap allocated structure, whose pointer can be in a local variable. If the active pointers are much, then 16 registers can be insufficient. Registers can be occupied for scratchpad in complex expressions, and so on...
This pretty justify the about 40% figure. And all these stack instructions are taken out from instruction flow by the stack engine, that produce only occasional syncronizations uops, probabily mainly when SP or BP are needed by some instructions out of stack engine...

bjt2 · Oct 7, 2016

Tuna-Fish said:
The main problem with this is that stack/non-stack memory operations are not evenly distributed. Imagine a very simple loop that loads two values, sums them, writes out a value, then moves all pointers up by one. This loop would have no stack operations.

But the sum would use RMW operation in the form of ADD [mem],AX, for example. This has 1 read and 1 write with the same address, thus uses 1 agu slot, but require 1 load and 1 store slot.
If you do it with suboptimal code that requires 2 agu slot, it's your fault...

Dresdenboy · Oct 7, 2016

cdimauro said:
Just to recap:
1) the Decoder unit does NOT send "4 instructions/cycle" (first slide) or "instructions" (second and third slide) to the Micro-Op Queue. Instead, it sends 4 micro-ops/cycle to it;
2) the Micro-Op Queue does NOT send "6 uops" to the Dispatcher (first and second slide). Instead, it sends 6 micro-ops to it;
3) the Dispatcher does NOT send "6 Micro-ops" to the INT unit + "4 Micro-ops" to the FP unit (third slide). Instead, it sends 6 ops + 4 ops.

Understood now?

On Steamroller ALL 256-bit instructions are decoded in 2 "macro-operations" (according to the Agner Fog's instruction table) and are split in 2 or more (usually) "ports" / "execution units". Even register-to-register instructions use 2 macro-ops. Instructions using a memory read usually need 3 ports, and instructions which do a memory write usually need 4 ports.

Last but not least, the strange thing is that VADD/VSUB/VMULs instructions use 5 or 6 ports. And even more strange thing, is that the 128-bit instructions also use 5 or 6 ports.

Re ops:
I think that needs some deeper research. I remember BD related diagrams at ascii.jp (by Yusuke Ohara I suppose) and those by Hiro Goto at PC Watch. In both cases the Cops got split up at some later stage (dispatch? which composed dispatch groups).

Re AVX ops on SR:
The order of columns changes sometimes in Agner's tables, IIRC. These numbers look like latencies to me.

cdimauro · Oct 7, 2016

bjt2 said:
If you consider that most code is written in C/C++ and all local variables and function parameters (on 32 bit code) are put on the stack, complex code with much local variables, maybe little static arrays, declared as local variables, that can't be put in a register, you can see that much data is on stack. And Dresdenboy put a graph: about 40% of memory operations, in at most some software, is on the stack... We are talking of push/pop for frame adjusting and local variable accessing. Frame adjusting is also frequent for the good programming practice of doing many small little subfunctions, instead of few big functions, for code maintenability... So there are many small functions, each of which require the stack frame creation and destruction, and local variables access. Dynamic languages require heap allocated structure, whose pointer can be in a local variable. If the active pointers are much, then 16 registers can be insufficient. Registers can be occupied for scratchpad in complex expressions, and so on...
This pretty justify the about 40% figure. And all these stack instructions are taken out from instruction flow by the stack engine, that produce only occasional syncronizations uops, probabily mainly when SP or BP are needed by some instructions out of stack engine...

In the past I've written a series of articles (in Italian, but with Google Translate you can have a good enough translation) which talked about x86/x64 statistics, collecting data with a Python script which disassembled as much instructions possible (around 1.7 millions in the used samples), so I've some knowledge about the argument.

Take a look here (where I report numbers about the used operands): you'll find that on the public beta of Adobe Photoshop CS6, the number of instructions with only the REG operand drastically decreased on the x64 version, compared to the x86 one.

The reason is easily explained looking at another article (where I report numbers about the used mnemonics): the total number of PUSH and POP instruction is reduced to 1/5, due to big ABI change and the extensive use of registers for passing function parameters, instead of pushing them to the stack.

The number of instructions referencing stack variables had some reduction also, but of the same order of magnitude, because you still use LEA instructions for generating pointers to the referenced variables, which increased in the x64 version.

bjt2 said:
But the sum would use RMW operation in the form of ADD [mem],AX, for example. This has 1 read and 1 write with the same address, thus uses 1 agu slot, but require 1 load and 1 store slot.
If you do it with suboptimal code that requires 2 agu slot, it's your fault...

But such kind of instructions are pretty rare. From the second link that I provided, you can find this:

Code:

Adobe Photoshop CS6 32 bit (PS32):
Mnemonic              Count      % Avg sz
MOV                  602130  34.48    3.9
PUSH                 257768  14.76    1.7
CALL                 126675   7.25    4.9
LEA                  121033   6.93    4.2
J                    110954   6.35    2.9
POP                   78536   4.50    1.0
CMP                   68943   3.95    3.4
ADD                   59819   3.42    3.0

Adobe Photoshop CS6 64 bit (PS64):
MOV                  642687  36.99    5.0
LEA                  186105  10.71    5.8
J                    132638   7.63    3.0
CALL                 131855   7.59    5.0
CMP                   77335   4.45    4.0
ADD                   53417   3.07    4.1

As you can see, the ADD instructions are a bit more than 3% of the total.

But if you look at the statistics of mnemonics combined with the used operands, the situation is even worse:

Code:

Adobe Photoshop CS6 32 bit (PS32):
Mnemonic              Count      % Avg sz
PUSH             REG                          187508  10.74    1.0
CALL             PC                           116761   6.69    5.0
MOV              REG,[REG+DISP]               112738   6.45    3.5
J                PC                           110954   6.35    2.9
MOV              REG,REG                       89188   5.11    2.0
POP              REG                           78535   4.50    1.0
PUSH             IMM                           69388   3.97    3.6
LEA              REG,[EBP-DISP*8]              62624   3.59    4.2
MOV              REG,[ESP+DISP*8]              58744   3.36    5.7
MOV              REG,[EBP-DISP*8]              51989   2.98    3.8
MOV              [EBP-DISP*8],REG              46840   2.68    3.8
ADD              REG,IMM                       43279   2.48    3.1

Adobe Photoshop CS6 64 bit (PS64):
MOV              REG,REG                      136358   7.85    2.9
J                PC                           132638   7.63    3.0
CALL             PC                           121294   6.98    5.0
MOV              REG,[RSP+DISP*8]             117040   6.74    7.0
MOV              [RSP+DISP*8],REG             108514   6.25    5.8
MOV              REG,[REG+DISP]                71937   4.14    4.7
MOV              [REG+DISP],REG                56648   3.26    4.7
LEA              REG,[RSP+DISP*8]              56141   3.23    6.4
LEA              REG,[REG+DISP]                55826   3.21    5.4
POP              REG                           47358   2.73    1.4
TEST             REG,REG                       46419   2.67    2.6
LEA              REG,[RIP+DISP]                36053   2.08    7.0
MOV              REG,IMM                       34042   1.96    4.9
XOR              REG,REG                       33555   1.93    2.5
ADD              REG,IMM                       31705   1.82    4.4

So, the most common form of the ADD instruction doesn't use a memory operand, but a register (with an immediate).

I only reported the statistics about the public beta of PS CS6, but I did the same operation with many other applications (MySQL, FirebirdSQL, MAME, Write, Crysis 2, etc.), and the situation is more or less the same.

Dresdenboy said:
Re ops:
I think that needs some deeper research. I remember BD related diagrams at ascii.jp (by Yusuke Ohara I suppose) and those by Hiro Goto at PC Watch. In both cases the Cops got split up at some later stage (dispatch? which composed dispatch groups).

This is what Agner reported for the Ops column:

"Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode."

So, it should be the decoder which does the task of generating two macro-ops.

But, of course, if there are other juicy information, they are welcome.

Re AVX ops on SR:
The order of columns changes sometimes in Agner's tables, IIRC. These numbers look like latencies to me.

You're right: the column next to the Ops one was reporting the latency. Sorry for the mistake.

So, I can only confirm that 2 macro-ops are generated for 256-bit AVX instructions.

Another important information which is reported is the Reciprocal Throughput, which is usually double for such kind of operations, but it's perfectly normal / expected: they need more time for being processed by the execution unit.

itsmydamnation · Oct 7, 2016

cdimauro said:
Please, re-read what I've stated: I know it, and I've reported such information.

So, I repeat again: to access the 2L+S units you have to pass through the AGUs, which are TWO.

First reread what said, there is a load+store queue behind the AGU so even if data must pass the AGU to enter/leave the PRF you can still load 2 and store 1 a cycle to cache, the reason for this would be so you can catch up after things like banking conflicts, waiting for loads that "far away" etc.

Second, i would say all of us are really doing is just educated guessing based off what has been said by AMD and what we know from leak + compile patches etc your position is nowhere near as definitive as you present.

Now, do you have to pass physical though the AGU for loads and stores? I wouldn't make that assumption as a hard and fast rule based off just lines in the diagram. Does that mean the FPU is magical because it doesn't require AGU's at all( no lines to it). Reality is much more complex. Remember these are PRF designs, meaning things are virtual and decoupled, whats needed is load or store address and data path in and out of the PRF (and obviously to update/retire).

Look at the Execute slide the LS write direct to the PRF for loads, only store travel via the AGU. you can also kind of glean this behavior in bulldozer where MOV r,r only requires EX, MOV r,m requires only AGU ( address + data out) but MOV m,r requires both AG and EX ( AGU to generate address EX PRF ports to enter into PRF). I haven't checked this for Zen, i assume it should be in the compile patch somewhere.

SO at a minimum i wouldn't assume stack engine loads have to touch the AGU, stack engine store might if that is the only data path out of the PRF, but after listening to Michael Clake again just now does the stack engine even handle stores and does it only handle stack addresses? Listening just to that section explicitly it sounds like it can handle any incrementing (push) memory calc and push it to the load queue.

BTW, AMD talks about TWO Load/Store units:

It also states in the exact same diagram 2load + 1 store a cycle, no 1+1 or 2+0 or , 2+1!

if your going to take diagrams as 100% factual ( instead of a simplification to give an idea of what happens where and how data flows through a core) you cant pick or choose what you like.

Yes, it works on address generation to save some calculations, and I think also to optimize function calls & returns.

But it has NO access to the memory. Take a look at the other slides.

You dont know that because the stack engine is never represented anywhere. All Michael Clake says is that it sits up in the dispatch stage and is used to and i quote "We know very early for things that look serially dependent, push after push after push, we know all the addresses immediately and to decouple them from each other and be able to load the registers". This also aligns to the AGU's not being used to physically move data from cache to the PRF. So it would make sense for the stack engine to be able to push to the load queue.

AMD also uses the AGU for address calculations. Take a look at Agner Fog's Instructions Table (the link was posted before).

i've studied these many times over the years and stuff from hirosgige goto, dresdenboy , stuffedcow and many more i have forgotten the end result is nothing is as simple as the diagrams make out

NostaSeronx · Oct 7, 2016

itsmydamnation said:
You dont know that because the stack engine is never represented anywhere.

pg 9 / "Decode - AMD and the new “Zen” High Performance x86 Core at Hot Chips 28"

Micro-op Queue
-> Microcode ROM
-> Stack Engine Memfile
=> Dispatch

Stack Engine is between Micro-op Queue & Dispatch, but is primarily controlled by the ALUs. As if its a L0d cache attached to the PRF.

The AGUs physically push or pop registers from the Load or Store queues.

cdimauro · Oct 8, 2016

itsmydamnation said:
First reread what said, there is a load+store queue behind the AGU so even if data must pass the AGU to enter/leave the PRF you can still load 2 and store 1 a cycle to cache, the reason for this would be so you can catch up after things like banking conflicts, waiting for loads that "far away" etc.

I... already... know... it!

OK, since we are stalled here, I make an example. Suppose that you have 3 completely independent instructions to be executed:

Code:

MOV    EAX,[EBX]
MOV    ECX,[ESI]
MOV    [EDI],EDX

and all instructions queues are empty (ideal condition: "I'm the king of the world!"). As you can see, they make 2 loads and 1 store, which in theory should be a perfect match to Zen, according to what you said.

Now think about the Dispatcher that has these 3 instructions (micro-ops, in reality) that are ready to be executed. But they require a memory access, which means... using an AGU.

Question: what it will do now?

Second, i would say all of us are really doing is just educated guessing based off what has been said by AMD and what we know from leak + compile patches etc your position is nowhere near as definitive as you present.

Well, I never claimed to have The Truth in my pocket. As anyone here, I'm lacking a lot of information, so we are just talking about what's available, and trying to figure out / guess something.

Now, do you have to pass physical though the AGU for loads and stores? I wouldn't make that assumption as a hard and fast rule based off just lines in the diagram.

If you don't pass through the AGUs, then you can have memory disambiguation problems, as you should know.

Does that mean the FPU is magical because it doesn't require AGU's at all( no lines to it). Reality is much more complex. Remember these are PRF designs, meaning things are virtual and decoupled, whats needed is load or store address and data path in and out of the PRF (and obviously to update/retire).

Sure. Nothing against this, and as I said, we lack a lot of information here.

However, you can still find some data which is useful to understand what might happen.

For example and regarding the FPU, at the first look to the bigger pictures there seems to be no connections to the load/store queues/units, but then you look at slide 12 (Load/Store and L2) and you see that the L1D sends data to it. Which means that FPU operations that need to access memory, have to pass through the AGUs, as it's logical / expected.

Look at the Execute slide the LS write direct to the PRF for loads, only store travel via the AGU.

Yes, I saw. That's really strange.

SO at a minimum i wouldn't assume stack engine loads have to touch the AGU, stack engine store might if that is the only data path out of the PRF, but after listening to Michael Clake again just now does the stack engine even handle stores and does it only handle stack addresses? Listening just to that section explicitly it sounds like it can handle any incrementing (push) memory calc and push it to the load queue.

I find it reasonable.

It also states in the exact same diagram 2load + 1 store a cycle, no 1+1 or 2+0 or , 2+1!

I think this's another misleading information spread in the slides. I suppose that when they talk about the 2 Load Store Units, they are referring to the 2 Load Store Queues.

Whereas when they talk about the 2 loads and 1 store, they are referring to the units which do the real job.

if your going to take diagrams as 100% factual ( instead of a simplification to give an idea of what happens where and how data flows through a core) you cant pick or choose what you like.

Well, the main problem here is that the slides report a lot of wrong data, which confuses the things.

But this doesn't stop to extract some information, thinking about how things should go. An example is the 2 AGUs, which limit the number of load/store operations dispatched, per cycle, to the proper load/store unit.

You dont know that because the stack engine is never represented anywhere. All Michael Clake says is that it sits up in the dispatch stage and is used to and i quote "We know very early for things that look serially dependent, push after push after push, we know all the addresses immediately and to decouple them from each other and be able to load the registers". This also aligns to the AGU's not being used to physically move data from cache to the PRF. So it would make sense for the stack engine to be able to push to the load queue.

Nostaseronx has already replied for this.

i've studied these many times over the years and stuff from hirosgige goto, dresdenboy , stuffedcow and many more i have forgotten the end result is nothing is as simple as the diagrams make out

Nothing simple. But if you focus on some details, you can find some useful information that can help guessing what's happening in some scenarios.

The 2 macro-ops sent by the decoder for 256-bit AVX instructions is a very interesting example, since it limits the decoder throughput (which can send a maximum of 4 macro-ops per cycle).

If Zen follows the same schema, then I really want to see how it behaves with code which uses such instructions. Which isn't that common yet, but with PS4 and XBoxOne based on Jaguar cores which use AVX 256-bit instructions, and games being ported to the PC as well, it might represent a bottleneck for this CPU.

bjt2 · Oct 8, 2016

cdimauro said:
In the past I've written a series of articles (in Italian, but with Google Translate you can have a good enough translation) which talked about x86/x64 statistics, collecting data with a Python script which disassembled as much instructions possible (around 1.7 millions in the used samples), so I've some knowledge about the argument.

Take a look here (where I report numbers about the used operands): you'll find that on the public beta of Adobe Photoshop CS6, the number of instructions with only the REG operand drastically decreased on the x64 version, compared to the x86 one.

The reason is easily explained looking at another article (where I report numbers about the used mnemonics): the total number of PUSH and POP instruction is reduced to 1/5, due to big ABI change and the extensive use of registers for passing function parameters, instead of pushing them to the stack.

The number of instructions referencing stack variables had some reduction also, but of the same order of magnitude, because you still use LEA instructions for generating pointers to the referenced variables, which increased in the x64 version.

But such kind of instructions are pretty rare. From the second link that I provided, you can find this:

Code:

Adobe Photoshop CS6 32 bit (PS32): Mnemonic Count % Avg sz MOV 602130 34.48 3.9 PUSH 257768 14.76 1.7 CALL 126675 7.25 4.9 LEA 121033 6.93 4.2 J 110954 6.35 2.9 POP 78536 4.50 1.0 CMP 68943 3.95 3.4 ADD 59819 3.42 3.0 Adobe Photoshop CS6 64 bit (PS64): MOV 642687 36.99 5.0 LEA 186105 10.71 5.8 J 132638 7.63 3.0 CALL 131855 7.59 5.0 CMP 77335 4.45 4.0 ADD 53417 3.07 4.1

As you can see, the ADD instructions are a bit more than 3% of the total.

But if you look at the statistics of mnemonics combined with the used operands, the situation is even worse:

Code:

Adobe Photoshop CS6 32 bit (PS32): Mnemonic Count % Avg sz PUSH REG 187508 10.74 1.0 CALL PC 116761 6.69 5.0 MOV REG,[REG+DISP] 112738 6.45 3.5 J PC 110954 6.35 2.9 MOV REG,REG 89188 5.11 2.0 POP REG 78535 4.50 1.0 PUSH IMM 69388 3.97 3.6 LEA REG,[EBP-DISP*8] 62624 3.59 4.2 MOV REG,[ESP+DISP*8] 58744 3.36 5.7 MOV REG,[EBP-DISP*8] 51989 2.98 3.8 MOV [EBP-DISP*8],REG 46840 2.68 3.8 ADD REG,IMM 43279 2.48 3.1 Adobe Photoshop CS6 64 bit (PS64): MOV REG,REG 136358 7.85 2.9 J PC 132638 7.63 3.0 CALL PC 121294 6.98 5.0 MOV REG,[RSP+DISP*8] 117040 6.74 7.0 MOV [RSP+DISP*8],REG 108514 6.25 5.8 MOV REG,[REG+DISP] 71937 4.14 4.7 MOV [REG+DISP],REG 56648 3.26 4.7 LEA REG,[RSP+DISP*8] 56141 3.23 6.4 LEA REG,[REG+DISP] 55826 3.21 5.4 POP REG 47358 2.73 1.4 TEST REG,REG 46419 2.67 2.6 LEA REG,[RIP+DISP] 36053 2.08 7.0 MOV REG,IMM 34042 1.96 4.9 XOR REG,REG 33555 1.93 2.5 ADD REG,IMM 31705 1.82 4.4

So, the most common form of the ADD instruction doesn't use a memory operand, but a register (with an immediate).

I only reported the statistics about the public beta of PS CS6, but I did the same operation with many other applications (MySQL, FirebirdSQL, MAME, Write, Crysis 2, etc.), and the situation is more or less the same.

This is what Agner reported for the Ops column:

"Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode."
So, it should be the decoder which does the task of generating two macro-ops.

But, of course, if there are other juicy information, they are welcome.

You're right: the column next to the Ops one was reporting the latency. Sorry for the mistake.

So, I can only confirm that 2 macro-ops are generated for 256-bit AVX instructions.

Another important information which is reported is the Reciprocal Throughput, which is usually double for such kind of operations, but it's perfectly normal / expected: they need more time for being processed by the execution unit.

Cesare, I am bjt2 from hwupgrade.it, so we know each other and i know these articles, that i have eagerly read at the time they were published...

They are STATIC counts... If the stack accesses are in a big loop, the DYNAMIC count is much more... Otherwise the graph posted by Dresdenboy are unexplainable... Why they have 40% of stack access? Because the dynamic count is very different of static count. With static count you give the same weight to code executed once and code executed maybe million times in a big count loop...

This is a C code function that calculate the local means on a bunch of 3D images...
Look at the last division about at the end... With static count you would say that this code has just ONE floating point division at the end... But with 8 256x256x128 images, the actual number of divisions is much more...

//funzione che calcola la mappa di medie locali
void MeanVar(double *img, int d_x, int d_y, int d_z, int sx, int sy, int sz, double *means, int num_c)
{
int i, j, k, x, y, z, m;
double val, count;

count = (double) ((2*d_x + 1)*(2*d_y + 1)*(2*d_z + 1));

for (i = d_x; i < (sx - d_x); i++)
for (j = d_y; j < (sy - d_y); j++)
for (k = d_z; k < (sz - d_z); k++)

for (m = 0; m < num_c; m++){

val = 0.;
for (z = -d_z; z <= d_z; z++)
for (y = -d_y; y <= d_y; y++)
for (x = -d_x; x <= d_x; x++)
val += img[x+i + sx*(y+j + sy*(z+k + sz*m))];

means[i + sx*(j + sy*(k + sz*m))] = val/count;
}

return;
}

EDIT: this is part of the code of this paper: Link

cdimauro · Oct 8, 2016

I know Marco, but may I suggest you to disassemble such function with 32 and 64 bit code, and see how the stack is used in both cases?

P.S. Of course, inner loops are the most important.

bjt2 · Oct 8, 2016

cdimauro said:
I know Marco, but may I suggest you to disassemble such function with 32 and 64 bit code, and see how the stack is used in both cases?

P.S. Of course, inner loops are the most important.

This code actual executed on four NVIDIA GPUs each of which has more than 16 registers (if i remember well, the GTX 690 we used has at least 128 SIMD registers, named R0...R127 in the disassembled ptx code, in the CUs)... We managed to go from 25 assembler instruction to 7 assembler instruction per inner cycle with some tricks... The performance did not change... Because we were limited by L1 cache bandwidth... I know that because we tried to overclock/underclock separately RAM and CUs and only CU clock correlated with performance... RAM clock did nothing from 4GT/s to 6GT/s... Anyway we kept the 7 instr code to use less FPUs possible...

Parameters plus local variables are 18. Keep some registers for scratch and you can see that some stack operation should be done even on 64 bit code... We obtained a speedup of 100 passing from 1 thread CPU code on a 4GHz CPU to GPU code on 6144 CUs at 900MHz... So even the CPU code was a bit slowed down by something...

This is a small fragment of the disassembled GPU code:

[...]

ld.param.u64 %rd2, [__cudaparm__Z14gpu_nlm_filterPfS_S_fffiiiiiiiiiifiii_d_img];
.loc 15 368 0
add.u64 %rd5, %rd2, %rd4;
add.s32 %r89, %r52, %r68;
add.s32 %r90, %r84, %r85;
add.s32 %r91, %r89, %r90;
cvt.s64.s32 %rd6, %r91;
mul.wide.s32 %rd7, %r91, 4;
add.u64 %rd8, %rd2, %rd7;
mov.s32 %r92, %r72;
$Lt_0_18690:
//<loop> Loop body line 368, nesting depth: 7, estimated iterations: unknown
.loc 15 370 0
ld.global.f32 %f8, [%rd5+0];
.loc 15 371 0

[...]

So at least 92 registers... Probabily 128...

This to say that optimizing compiler might trade register pressure for speed...

cdimauro · Oct 8, 2016

Yes, but it's important to see how much stack is used on x86 & x64, which was the argument which we were talking about.

The huge registers usage in your case is due to the large number of SIMD registers of the GPU, so you cannot expect that this part of code requires 64 registers on x86 or x64 (but in this case more can be used).

In short: I doubt that this code makes use of a massive stack usage at runtime on x64.

Abwx · Oct 8, 2016

cdimauro said:
I think this's another misleading information spread in the slides. I suppose that when they talk about the 2 Load Store Units, they are referring to the 2 Load Store Queues.

.

It s written 2 load + 1 Store per cycle..

The only thing misleading here is your insistance to make up datas that do not exist, worse, that is contradicted by the very slides you re posting..

bjt2 · Oct 8, 2016

cdimauro said:
Yes, but it's important to see how much stack is used on x86 & x64, which was the argument which we were talking about.

The huge registers usage in your case is due to the large number of SIMD registers of the GPU, so you cannot expect that this part of code requires 64 registers on x86 or x64 (but in this case more can be used).

In short: I doubt that this code makes use of a massive stack usage at runtime on x64.

I don't know how to perform dynamic analysis (i never used a debugger... At most i gave an eye on assembler sources as you can see)... But the analysis Dresdenboy posted clearly demonstrates that 40% stack activity is possible... And indeed since code that uses more than 2 memory uop/cycle, probabily is RAM BW starved, the bottleneck will be not the AGUs...
Think of it: heavy HPC code is probabily working on data that not fit in the L1 cache... Probabily neither L2 or L3... So it must rely on RAM BW... Moreover with an IPC around 2.5, we can expect that most of them are not memory accesses, due to RAM BW limitations... Limited IPC code is that with many jumps or memory accesses, because memory is often a bottleneck... Rarely a general purpose program has IPC higher than 1.5... And with 50% of instruction being memory accesses, even two threads of this kind can not saturate 2 AGUs... More AGUs can process faster a peak, but in steady state, less than 2 agu operations/cycle are needed even for two heavy HPC threads...
HPC software with more than 2 of IPC are software with heavy and complex calculation for each data. A convolution like image filter like the one i have posted is bandwidth starved, because it does at most a simple add and mul for each data... Too few calculations for each memory access...

The pattern would be:
- LOAD initial value (0)
- ACCUMULATE the value: load from memory two data; the addresses must be calculated from the indexes, so integer arithmetic and scratch registers must be used, multiply the two numbers and add to the accumulator
- At the end of n cycles of accumulation, divide by the number of cycles
- Repeat for each pixel.

Apart the final division, you can see that for each memory access, a FMAC is needed.
This is the worst case for Zen. If you implement as a FMAC or a FMUL and a FADD, for each couple of memory accesses you need a FMAC. So you cut in HALF the throughput, because you have not 4 AGUs, nor 4 Load ports.
But this at least requires an index increment, supposing using an onedimensional vector...
But as you can see in the code I posted, often we have 2D or 3D images.
So we have 2 or three indices to increment. And in case of convolution they are 4 or 6, as you can see in the code i gave you...
Moreover for each cycle i must calculate the offsets with 2 or 3 INT ADDs and 2 or 3 MULs.
So before I need to actually load the two data, each cycle I must do many operations before i can even load the two data...
With outside precalculation of offsets (like I said before), we can reduce this to few operations. Plus the load and the FMAC for each cycle... One index increment, memory access (with precalculated index) and a fmac. 7 operation per cycle are obtained in another code, not posted, that does more complex calculation than a single FMAC. But if you keep the code simple, you need at least 3 add a 3 mul only to calculate the offsets and only then you can load the two data...

Only with monodimensional data you need 2 indexes and so 3 arithmetic instructions (1 fmac) and 2 memory access...

EDIT: obviously i forgot the 6 CMP and the 6 conditional JUMPs... They too, in turn, are executed... 1 fused uop per cycle, plus others once a while (when an inner loop finishes)

cdimauro · Oct 8, 2016

Abwx said:
It s written 2 load + 1 Store per cycle..

The only thing misleading here is your insistance to make up datas that do not exist, worse, that is contradicted by the very slides you re posting..

Here's the slide:

Which reports (to the right side):

"2 Load/Store units"

So, you continue to have problems not only reading what other people posts, but also taking a look at a simple slide...

Abwx · Oct 8, 2016

cdimauro said:
Here's the slide:

Which reports (to the right side):

"2 Load/Store units"

So, you continue to have problems not only reading what other people posts, but also taking a look at a simple slide...

You confirmed that you are here just to troll, because on the bottom left corner it s explicitely written "2 loads + 1 store per cycle"...

This is of course in reference of the L/S capability but i guess that everyone here, set apart you, did notice this a long time ago..

cdimauro · Oct 8, 2016

Abwx said:
You confirmed that you are here just to troll,

And here the usual personal attack, after that I proved that you need not one, but two good pair of glasses.

because on the bottom left corner it s explicitely written "2 loads + 1 store per cycle"...

Tell me where I have negated this.

This is of course in reference of the L/S capability but i guess that everyone here, set apart you, did notice this a long time ago..

Guess what: we were "just" discussing of the L/S capability. IF and only IF you had the pleasure to read and understand what was written...

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Golden Member

Senior member

Golden Member

Senior member

Member

Member

Member

Golden Member

Senior member

Senior member

Senior member

Golden Member

Member

Diamond Member

Diamond Member

Member

Senior member

Member

Senior member

Member

Lifer

Senior member

Member

Lifer

Member