AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 57 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
You confirmed that you are here just to troll, because on the bottom left corner it s explicitely written "2 loads + 1 store per cycle"...

This is of course in reference of the L/S capability but i guess that everyone here, set apart you, did notice this a long time ago..
He's referring (not "red herring" ) to the right side citing the units while you point at the lower left showing the capabilities of the queues + L1D$.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
But if the L/S capability were in any case limited by the 2 AGUs, why bother to put 3 ports on the L1D, slowing it down? To fit the peak? I have already shown you that even with 2 heavy threads approaching 2 agu/cycle is very difficult, even not considering stack accesses. 2 agu/cycle are needed for low calculus/high load/store algorithms... But these are limited by RAM BW because there is no simple algorithm that requires low memory footprint, other than possibly some specific syntetic benchmark aimed to measure some of these bottleneck...
The simplest memory hungry algorithm is the summation of two linear vectors, with results in a third (if it's in place in one of the two vector is a RMW instruction and so we spare an AGU).

We have:

MOV AX,[mem1+indexreg]
MOV BX,[mem2+indexreg]
ADD AX,BX
MOV [mem3+indexreg],AX
INC indexreg
CMP indexreg,limit
JNZ start

We have for each cycle 2 load, 1 store, 3 agu, 2 alu and a fused cmp+jmp
This need at least 3 cycle for each loop on Zen CPUs, but suddenly there are cache misses... So this is limited by ram BW...

If one of the data can be destroyed, we have:

MOV AX,[mem1+indexreg]
ADD [mem2+indexreg],AX
INC indexreg
CMP indexreg,limit
JNZ start

here we have 2 load, 1 store, 2 AGU, 2 alu and a fused cmp+jmp... Again we are limited by memory BW, even if the very first cycles, done from the cache, are done in 2 cycles/loop...
Simply even the smartest prefetch can't sustain more than an iteration every a few dozen clock cycles...


EDIT: and anyway, with the use of more registers you can do loop unrolling of three in the first case and two on the second case and obtain an equivalent 1 cycle loop (exploiting the OOO nature of the CPUs), when not BW starved, namely the very first cycles, if the data are in cache, that is not granted...
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Yes, but it's important to see how much stack is used on x86 & x64, which was the argument which we were talking about.

The huge registers usage in your case is due to the large number of SIMD registers of the GPU, so you cannot expect that this part of code requires 64 registers on x86 or x64 (but in this case more can be used).

In short: I doubt that this code makes use of a massive stack usage at runtime on x64.
I think one point is the used amount of local variables/structs as these would also land on the stack plus the contents of each register, which needs to be kept for later. The ABI and 2x the regs in 64bit help of course.

One thing to add to the stack engine: Mike explicitly talked about early store to load forwarding, as stack ops usually go PUSH -> POP, not the load first. So does this have to go to the cache? Only, if some dependency got dedected, also a job of the stack engine.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
He's referring (not "red herring" ) to the right side citing the units while you point at the lower left showing the capabilities of the queues + L1D$.

Not at all, he s refering to the capability per cycle, this has already been pointed by a member who told him that :

It also states in the exact same diagram 2load + 1 store a cycle, no 1+1 or 2+0 or , 2+1!

To wich he answered :

I think this's another misleading information spread in the slides. I suppose that when they talk about the 2 Load Store Units, they are referring to the 2 Load Store Queues.

Whereas when they talk about the 2 loads and 1 store, they are referring to the units which do the real job.

Tell me where I have negated this.

Guess what: we were "just" discussing of the L/S capability. IF and only IF you had the pleasure to read and understand what was written...

Your negation s in the quotes i made above, and wich can all be found here, in one of your post :

http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/first-summit-ridge-zen-benchmarks.2482739/page-56#post-38509839

As for discussing the LSU capabilities, so much, are you sure that it s not rather about ignoring its capabilities..?.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
I think one point is the used amount of local variables/structs as these would also land on the stack plus the contents of each register, which needs to be kept for later. The ABI and 2x the regs in 64bit help of course.

One thing to add to the stack engine: Mike explicitly talked about early store to load forwarding, as stack ops usually go PUSH -> POP, not the load first. So does this have to go to the cache? Only, if some dependency got dedected, also a job of the stack engine.
The compatibility with the isa should be guaranteed. Stacks are private between the threads and usually are never accessed by other CPUs or DMA devices, but without rush, i think that these data should be ultimately written in RAM, or at least in caches...

EDIT: the stack engine must propagate the data in one specific case: if there are instructions in the flow, not managed by the stack engine, and so not removed from the flux, that uses the stack data as instructions with direct addressing, such as MOV AX,[SP+BP] etc...

EDIT: i would also like to know how aliasing problem is resolved: i can access the stack with another register, or with another offset... How this is managed?
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Not at all, he s refering to the capability per cycle, this has already been pointed by a member who told him that
Yeah, I've seen that reply. Interesting: AMD only mentions the load capacity on the right side.

The compatibility with the isa should be guaranteed. Stacks are private between the threads and usually are never accessed by other CPUs or DMA devices, but without rush, i think that these data should be ultimately written in RAM, or at least in caches...

EDIT: the stack engine must propagate the data in one specific case: if there are instructions in the flow, not managed by the stack engine, and so not removed from the flux, that uses the stack data as instructions with direct addressing, such as MOV AX,[SP+BP] etc...

EDIT: i would also like to know how aliasing problem is resolved: i can access the stack with another register, or with another offset... How this is managed?
Somewhere I have som big traces... These might help.

These instructions with direct addressing of stack ranges are the reason for the dependency checks. And if the data written to the stack needs to go to the RAM, these stores might be done at a lower priority in a deferred way.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Yeah, I've seen that reply. Interesting: AMD only mentions the load capacity on the right side.


Somewhere I have som big traces... These might help.

These instructions with direct addressing of stack ranges are the reason for the dependency checks. And if the data written to the stack needs to go to the RAM, these stores might be done at a lower priority in a deferred way.

So the stack engine should examine ALL memory instructions to see if they access stack? It should do all AGU calculations? What about indirect addressing? This can be very slow... I think that checkpointing was done for this reason... If at retire stage i detect a conflict between a memory operation and a stack operation on the same physical address, this should be managed...

Let's make an example: suppose we PUSH (write) something on the stack and a subsequent instruction, with a complex addressing mode and not using SP, trying to load the same address, is directed to the normal path: if the L/S units are not connected with the memfile, the instruction could see an old value... So the memfile acts as an L0D cache... Obviously the opposite is true: the memfile should check also the L1D cache, with an already calculated address, without using the standard AGUs...
 
Mar 10, 2006
11,715
2,012
126
We know from the AoTS leak that the 8C/16T Summit part clocks at 3.2GHz max single core turbo, 2.8GHz base, 3.05GHz all core turbo (hey, was that Summit Ridge in the Blender demo running at 3GHz or 3.05GHz).

Let's forget Bulldozer for a minute and rewind to Phenom II X6. This thing came with six cores, ran at 3.3GHz base and 3.7GHz max single core turbo (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103913).

According to THG, the Phenom II X6 was able to be OC'd to 4GHz (http://www.tomshardware.com/reviews/phenom-ii-x6-1100t-thuban-amd,2810-8.html).

If we assume similar levels of overclockability for Summit Ridge, we could be looking at headroom of around 10% on all 8 cores, or roughly ~3.52GHz.

If we assume overclockability more like the speed-demon Kaveri (which could go from 4.1GHz max single core to 4.7GHz, or about 15%), then we could see Summit Ridge be pushed to ~3.7GHz.
 
Last edited:

cdimauro

Member
Sep 14, 2016
163
14
61
I don't know how to perform dynamic analysis (i never used a debugger... At most i gave an eye on assembler sources as you can see)...
That's enough, for small pieces of code.
But the analysis Dresdenboy posted clearly demonstrates that 40% stack activity is possible... And indeed since code that uses more than 2 memory uop/cycle, probabily is RAM BW starved, the bottleneck will be not the AGUs...
The analysis has a lot of limits, if you read the paper.
Think of it: heavy HPC code is probabily working on data that not fit in the L1 cache... Probabily neither L2 or L3... So it must rely on RAM BW... Moreover with an IPC around 2.5, we can expect that most of them are not memory accesses, due to RAM BW limitations... Limited IPC code is that with many jumps or memory accesses, because memory is often a bottleneck... Rarely a general purpose program has IPC higher than 1.5... And with 50% of instruction being memory accesses, even two threads of this kind can not saturate 2 AGUs... More AGUs can process faster a peak, but in steady state, less than 2 agu operations/cycle are needed even for two heavy HPC threads...
So, less AGUs are better? If they are rarely used, there's no need to put 3 AGUs. And not even 2 load + 1 store unit. Right? So you can save transistors, power, and have a simpler design, right?
HPC software with more than 2 of IPC are software with heavy and complex calculation for each data. A convolution like image filter like the one i have posted is bandwidth starved, because it does at most a simple add and mul for each data... Too few calculations for each memory access...

The pattern would be:
- LOAD initial value (0)
- ACCUMULATE the value: load from memory two data; the addresses must be calculated from the indexes, so integer arithmetic and scratch registers must be used, multiply the two numbers and add to the accumulator
- At the end of n cycles of accumulation, divide by the number of cycles
- Repeat for each pixel.

Apart the final division, you can see that for each memory access, a FMAC is needed.
This is the worst case for Zen. If you implement as a FMAC or a FMUL and a FADD, for each couple of memory accesses you need a FMAC. So you cut in HALF the throughput, because you have not 4 AGUs, nor 4 Load ports.
But this at least requires an index increment, supposing using an onedimensional vector...
But as you can see in the code I posted, often we have 2D or 3D images.
So we have 2 or three indices to increment. And in case of convolution they are 4 or 6, as you can see in the code i gave you...
Moreover for each cycle i must calculate the offsets with 2 or 3 INT ADDs and 2 or 3 MULs.
So before I need to actually load the two data, each cycle I must do many operations before i can even load the two data...
With outside precalculation of offsets (like I said before), we can reduce this to few operations. Plus the load and the FMAC for each cycle... One index increment, memory access (with precalculated index) and a fmac. 7 operation per cycle are obtained in another code, not posted, that does more complex calculation than a single FMAC. But if you keep the code simple, you need at least 3 add a 3 mul only to calculate the offsets and only then you can load the two data...

Only with monodimensional data you need 2 indexes and so 3 arithmetic instructions (1 fmac) and 2 memory access...

EDIT: obviously i forgot the 6 CMP and the 6 conditional JUMPs... They too, in turn, are executed... 1 fused uop per cycle, plus others once a while (when an inner loop finishes)
In fact your code actually has just one "hot spot" for memory reads only on the innermost loop, while the rest of the time it makes a lot of integer/scalar operations:
Code:
                            for (x = -d_x; x <= d_x; x++)
00007FF7B640123B 45 3B C3             cmp         r8d,r11d 
00007FF7B640123E 7F 2D                jg          MeanVar+26Dh (07FF7B640126Dh) 
00007FF7B6401240 42 8D 04 0B          lea         eax,[rbx+r9] 
00007FF7B6401244 03 C1                add         eax,ecx 
00007FF7B6401246 0F AF C7             imul        eax,edi 
00007FF7B6401249 41 03 C0             add         eax,r8d 
00007FF7B640124C 03 C2                add         eax,edx 
00007FF7B640124E 48 98                cdqe 
00007FF7B6401250 49 8D 0C C4          lea         rcx,[r12+rax*8] 
00007FF7B6401254 41 8B C3             mov         eax,r11d 
00007FF7B6401257 41 2B C0             sub         eax,r8d 
00007FF7B640125A FF C0                inc         eax 
00007FF7B640125C 48 63 D0             movsxd      rdx,eax 
                                val += img[x + i + sx*(y + j + sy*(z + k + sz*m))];
00007FF7B640125F F2 0F 58 01          addsd       xmm0,mmword ptr [rcx] 
00007FF7B6401263 48 83 C1 08          add         rcx,8 
00007FF7B6401267 48 83 EA 01          sub         rdx,1 
00007FF7B640126B 75 F2                jne         MeanVar+25Fh (07FF7B640125Fh) 
                        for (y = -d_y; y <= d_y; y++)
00007FF7B640126D 8B 8C 24 B8 00 00 00 mov         ecx,dword ptr [j] 
00007FF7B6401274 41 FF C1             inc         r9d 
00007FF7B6401277 8B 94 24 C0 00 00 00 mov         edx,dword ptr [i] 
00007FF7B640127E 45 3B CE             cmp         r9d,r14d 
00007FF7B6401281 0F 8E 0B FF FF FF    jle         MeanVar+192h (07FF7B6401192h)
Well, the code generated by Visual Studio 2015 doesn't seem to be so much optimized, since it's not even able to vectorize the code (it uses only scalar double FP operations). However it did a good work of recognizing the invariant parts, and pre-calculate some values.

One thing that you can easily verify is that in this part of code the stack variables are rarely used, compared to all memory accesses, and that happens also in the rest of the code.
 

cdimauro

Member
Sep 14, 2016
163
14
61
He's referring (not "red herring" ) to the right side citing the units while you point at the lower left showing the capabilities of the queues + L1D$.
Correct.
Not at all, he s refering to the capability per cycle, this has already been pointed by a member who told him that :

To wich he answered :

Your negation s in the quotes i made above, and wich can all be found here, in one of your post :

http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/first-summit-ridge-zen-benchmarks.2482739/page-56#post-38509839
There's no negation. See below.
As for discussing the LSU capabilities, so much, are you sure that it s not rather about ignoring its capabilities..?.
Well, you continue to ignore the point of the discussion, which should have been clear with the post that you reported.

The example which I gave with the three instructions is self-explicative, as well as the rest of my writing.

If you aren't able to understand it, then it's your problem. Otherwise you can easily answer the question that I made.
 
Last edited:

cdimauro

Member
Sep 14, 2016
163
14
61
I think one point is the used amount of local variables/structs as these would also land on the stack plus the contents of each register, which needs to be kept for later. The ABI and 2x the regs in 64bit help of course.
Absolutely. Here is the x86 version of the above code:
Code:
                            for (x = -d_x; x <= d_x; x++)
0007117E 3B CF                cmp         ecx,edi
00071180 7F 2A                jg          MeanVar+1ACh (0711ACh)
00071182 8D 04 32             lea         eax,[edx+esi]
00071185 8B 55 E0             mov         edx,dword ptr [img]
00071188 03 45 EC             add         eax,dword ptr [j]
0007118B 0F AF 45 10          imul        eax,dword ptr [sx]
0007118F 03 C1                add         eax,ecx
00071191 03 45 F8             add         eax,dword ptr [i]
00071194 8D 14 C2             lea         edx,[edx+eax*8]
00071197 8B C7                mov         eax,edi
00071199 2B C1                sub         eax,ecx
0007119B 40                   inc         eax
0007119C 0F 1F 40 00          nop         dword ptr [eax]
                                val += img[x + i + sx*(y + j + sy*(z + k + sz*m))];
000711A0 F2 0F 58 0A          addsd       xmm1,mmword ptr [edx]
000711A4 83 C2 08             add         edx,8
000711A7 83 E8 01             sub         eax,1
000711AA 75 F4                jne         MeanVar+1A0h (0711A0h)
                        for (y = -d_y; y <= d_y; y++)
000711AC 8B 4D 08             mov         ecx,dword ptr [d_y]
000711AF 46                   inc         esi
000711B0 8B 55 F4             mov         edx,dword ptr [ebp-0Ch]
000711B3 3B F1                cmp         esi,ecx
000711B5 0F 8E 65 FF FF FF    jle         MeanVar+120h (071120h)
and it's quite evident that the stack is used a lot, albeit not in the innermost loop.
One thing to add to the stack engine: Mike explicitly talked about early store to load forwarding, as stack ops usually go PUSH -> POP, not the load first. So does this have to go to the cache? Only, if some dependency got dedected, also a job of the stack engine.
IMO it should go to the cache too. Having two different mechanism to keep track of memory disambiguation, in the stack and in the L/S section, is overkill.
The compatibility with the isa should be guaranteed. Stacks are private between the threads and usually are never accessed by other CPUs or DMA devices, but without rush, i think that these data should be ultimately written in RAM, or at least in caches...
The stack can be also shared between threads/cores, albeit it's a bit tricky.
EDIT: the stack engine must propagate the data in one specific case: if there are instructions in the flow, not managed by the stack engine, and so not removed from the flux, that uses the stack data as instructions with direct addressing, such as MOV AX,[SP+BP] etc...

EDIT: i would also like to know how aliasing problem is resolved: i can access the stack with another register, or with another offset... How this is managed?
Eh. That's the problem of memory disambiguation, which is solved comparing the resulting (virtual) pointers values.
So the stack engine should examine ALL memory instructions to see if they access stack? It should do all AGU calculations? What about indirect addressing? This can be very slow...
No, I think that the stack engine can safely track only [RSP + offset] and [ESP + offset] memory accesses. EBP/RBP can be also used with not stack data pointers, or even for holding plain data, so it's a bit difficult to use the same logic/methods here. Which is OK, because with 64 bit code it's RSP which is usually used for referencing stack data.
I think that checkpointing was done for this reason... If at retire stage i detect a conflict between a memory operation and a stack operation on the same physical address, this should be managed...
Yes, can be. But you still need to propagate the writes to L/S section, and updating the L1D too (when it's needed / forced to be).
Let's make an example: suppose we PUSH (write) something on the stack and a subsequent instruction, with a complex addressing mode and not using SP, trying to load the same address, is directed to the normal path: if the L/S units are not connected with the memfile, the instruction could see an old value... So the memfile acts as an L0D cache... Obviously the opposite is true: the memfile should check also the L1D cache, with an already calculated address, without using the standard AGUs...
Exactly: that's an hazard.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
That's enough, for small pieces of code.

The analysis has a lot of limits, if you read the paper.

So, less AGUs are better? If they are rarely used, there's no need to put 3 AGUs. And not even 2 load + 1 store unit. Right? So you can save transistors, power, and have a simpler design, right?

In fact your code actually has just one "hot spot" for memory reads only on the innermost loop, while the rest of the time it makes a lot of integer/scalar operations:
Code:
                            for (x = -d_x; x <= d_x; x++)
00007FF7B640123B 45 3B C3             cmp         r8d,r11d
00007FF7B640123E 7F 2D                jg          MeanVar+26Dh (07FF7B640126Dh)
00007FF7B6401240 42 8D 04 0B          lea         eax,[rbx+r9]
00007FF7B6401244 03 C1                add         eax,ecx
00007FF7B6401246 0F AF C7             imul        eax,edi
00007FF7B6401249 41 03 C0             add         eax,r8d
00007FF7B640124C 03 C2                add         eax,edx
00007FF7B640124E 48 98                cdqe
00007FF7B6401250 49 8D 0C C4          lea         rcx,[r12+rax*8]
00007FF7B6401254 41 8B C3             mov         eax,r11d
00007FF7B6401257 41 2B C0             sub         eax,r8d
00007FF7B640125A FF C0                inc         eax
00007FF7B640125C 48 63 D0             movsxd      rdx,eax
                                val += img[x + i + sx*(y + j + sy*(z + k + sz*m))];
00007FF7B640125F F2 0F 58 01          addsd       xmm0,mmword ptr [rcx]
00007FF7B6401263 48 83 C1 08          add         rcx,8
00007FF7B6401267 48 83 EA 01          sub         rdx,1
00007FF7B640126B 75 F2                jne         MeanVar+25Fh (07FF7B640125Fh)
                        for (y = -d_y; y <= d_y; y++)
00007FF7B640126D 8B 8C 24 B8 00 00 00 mov         ecx,dword ptr [j]
00007FF7B6401274 41 FF C1             inc         r9d
00007FF7B6401277 8B 94 24 C0 00 00 00 mov         edx,dword ptr [i]
00007FF7B640127E 45 3B CE             cmp         r9d,r14d
00007FF7B6401281 0F 8E 0B FF FF FF    jle         MeanVar+192h (07FF7B6401192h)
Well, the code generated by Visual Studio 2015 doesn't seem to be so much optimized, since it's not even able to vectorize the code (it uses only scalar double FP operations). However it did a good work of recognizing the invariant parts, and pre-calculate some values.

One thing that you can easily verify is that in this part of code the stack variables are rarely used, compared to all memory accesses, and that happens also in the rest of the code.

Ok, there is no stack use, but as you can see the compiler managed to do only a true memory access per cycle (addsd xmm0,mmword ptr [rcx])...

So only one strict AGU use per cycle.

There are two LEA but:

1) they are intermixed with 8-10 other integer intructions, even IMUL that are very complex, so the AGU should not be a limit and
2) we can not assume they will use AGU: if I remember well, LEA are executed also in ALUs on AMD archs...

So I don't see a limit even for two threads...

2 AGUs should let the CPU process the code at few clock cycle per loop iteration... Anyway this code is awful... After the IMUL there is suddenly an instructions that depends on it... Lots of dependencies...


But for obvious reasons, i gave you the code of only an ancillary and simple fuction. The main filter, as you can see if you follow the link at the paper, has complicated calculations in the inner loop, involving also trascendent functions (exp), squaring, multiplications, divisions and so on... The loops are six. In the third loop is read one of the source data and in the innermost loop only one data per cycle is read, with which complicated calculation are made, with at least six FP instructions, of which one is division and one exp... In this case the AGU should be not a bottleneck.

The worst case will be a convolution filter... But in this case you are limited by the RAM BW...
 

cdimauro

Member
Sep 14, 2016
163
14
61
I'm sorry for not being clear before: I did the analysis of your code only to show how very low is the stack usage in this case (with x64 in particular, which has plenty of available registers).
Naturally, it's not limited by the lack of AGUs, and BTW I don't think that the memory bandwidth should be a bottleneck too, since the code isn't using at all vector instructions. And, a stack engine, whatever is the work that it does, doesn't help as well here.
That's it.
 

dark zero

Platinum Member
Jun 2, 2015
2,655
138
106
Wow, this is interesting. So what is the stimate of the performance of Zen?
If goes to Haswell levels on ST, I'll go with AMD, not for performance, but to maintain a competitor alive (just like the people who watched the Warcraft movie)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I'm sorry for not being clear before: I did the analysis of your code only to show how very low is the stack usage in this case (with x64 in particular, which has plenty of available registers).
Naturally, it's not limited by the lack of AGUs, and BTW I don't think that the memory bandwidth should be a bottleneck too, since the code isn't using at all vector instructions. And, a stack engine, whatever is the work that it does, doesn't help as well here.
That's it.
For both of you I have an interesting link, providing up to 100m long traces incl. a small player, which can be used for counting specific uops:
http://www.cis.upenn.edu/~milom/cis501-Fall12/traces/trace-format.html

I once had a look at the small gcc trace. It contains (for 1000 instructions executed): 232 loads (incl. 35 POPs -> 13%) and 108 stores (incl. 29 PUSHs -> 32%).
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
Wow, this is interesting. So what is the stimate of the performance of Zen?
If goes to Haswell levels on ST, I'll go with AMD, not for performance, but to maintain a competitor alive (just like the people who watched the Warcraft movie)

Not one of us has any real clue about performance (some peoples guesses might be right but its not much more then flipping a coin). My approch has been to look for upper/lower bounds, things that could limit ST performance.

Right now there isn't really anything we can find that says Zen can have just as high ST perf as broadwell, skylake and beyond have bigger internal structures so should get more ILP all other things being equal.

What we have no idea about is prefetchers/predictors, but this is one area CON/CAT was actually pretty good at, i would expect evolution on these units from those cores, so likely they are behind intel in that regard but its probably not going to be massive. In cases intel have actually turned down the aggressiveness of their prefetch/predict because they could gain more from power saved(thus higher clock) then the perf lost by the extra cache misses.

There are a few "interesting points" that we can see with Zen for ST:
Does the 6 small integer schedulers have a impact on extractable ILP vs one big one, if so how much.
Whats the miss predict penalty now with both a "checkpoint unit" amd said almost nothing about and with the uop cache.

With SMT the interesting one for me is the hard partitioned store queue, i've offen wondered how late (aggressive) intel are in terms of writing data out to cache, There is a lot of power to be saved if you can store to load forward it compared to writing it to cache and then reading it again.

A lot more of the questions are actually around SIMD performance, this whole stack engine discusions is really about having enough load/store and address generation for very heavy sustained amounts of 3 operand operations (AVX, FMA).

On top of this a Zen core "only" has IVB amount of load and store in and out of the core and only 128bit wide FP units. for SSE and x64/x86 the 2AGU to 4 ALU should almost never be an issue, intel only added 3rd AGU when they went to 256bit units and data paths so thats a hint as well. Now per clock for 128bit SIMD ops Bulldozer/piledriver were approx a match (module vs core) for SIMD, in 256bit ops bulldozer/piledrive had some penalties that lowered their performance. So when we look at Zen, it has quite a large amount of changes to FP SIMD handling which should help boost its performance above CON core level and hopefully IVB for 256bit ops. for 128 bit ops Zen could very well be ahead of all intel chips we will have to wait and see. The improvements we know Zen has over CON cores are:

Larger FP register file:
More execution resources
lower execution latency ( except for FMA)
lower load to use latency
a buffer pre FPU scheduler so ops can still be sent to FPU even if FPU scheduler is full

What we dont know if Zen has improved 256bit handling but its a reasonably safe assumption it has as these issues where internal FPU issues, not core wide things.

Now clocks are anyones guess, there is Fmax for the core on the process it on, Fxmax for the core on an ideal process (not saying LLP isn't an "ideal" process) and clock/power scaling.

Now personally for me there have only been two people i would value there words on in terms of Zen performance, one is Neilz at B3D who works for an OEM and Thevenin who also works for an OEM. Both have said quite positive things about Zen without being specific ( Neilz was commenting on the 32 core part) , Thevenin appeared to be commenting on the 8 core part.
 
Reactions: .vodka

cdimauro

Member
Sep 14, 2016
163
14
61
For both of you I have an interesting link, providing up to 100m long traces incl. a small player, which can be used for counting specific uops:
http://www.cis.upenn.edu/~milom/cis501-Fall12/traces/trace-format.html

I once had a look at the small gcc trace. It contains (for 1000 instructions executed): 232 loads (incl. 35 POPs -> 13%) and 108 stores (incl. 29 PUSHs -> 32%).
Thanks. I took a look at the same trace, but I've something to say something about the binary code and tool itself.

I suppose that trace comes from a code which is compiled by GCC itself, however it doesn't look so much optimized. I saw some word (16-bit code) usage (with 64-bit code!), which usually is slow, and in general a more x86-like approach for function calls, with pushes and pops. In fact, I don't see a good usage of the available registers.

I've written a small Python script which generates some stats, and here are the ones about the register usage:
Code:
Reg: Count
  0: 211
  1:  67
  2: 118
  3:  68
  4: 299
  5:  87
  6:  61
  7:  50
  8:  23
  9:  18
 10:  10
 11:  10
 12:  51
 13:  57
 14:  23
 15:  14
 44: 319
 45: 140
It might also be that this is related to the fact that those are the first 1000 traces (not instructions), so there's some setup code. It'll be more interesting to see traces about compute-intensive tasks.

Anyway, here are also some stats about instructions:
Code:
Instruction           Count    %
MOV                     227  30%
J                       157  20%
CMP                     111  14%
TEST                     48   6%
PUSH                     35   4%
POP                      29   3%
ADD                      22   2%
JMP                      17   2%
MOVZX                    15   1%
LEA                      13   1%
XOR                      11   1%
CALL                     10   1%
RET                       9   1%
SHL                       9   1%
AND                       9   1%
SUB                       8   1%
NOP                       8   1%
MOVSX                     5   0%
SHR                       4   0%
SET                       3   0%
XCHG                      2   0%
NOT                       1   0%
OR                        1   0%
TOTAL                   754 100%
So, PUSH and POP take 4 and 3% respectively of the total.

Finally, regarding the micro-ops, this is the situation:
Code:
Micro-op              Count    %
LOAD                    232  23%
JMP_IMM                 183  18%
SUB_IMM                 113  11%
STORE                   108  10%
ADD_IMM                  67   6%
SAVE_PC                  58   5%
ADD                      58   5%
SUB                      51   5%
AND                      49   4%
LEA                      16   1%
JMP_REG                  10   1%
NOP                      10   1%
ZEXT_BYTE_TO_DWORD        8   0%
AND_IMM                   8   0%
ZEXT_WORD_TO_DWORD        7   0%
SHL_IMM                   7   0%
SEXT_DWORD_TO_QWORD       5   0%
SHR_IMM                   4   0%
XOR                       2   0%
SHL                       2   0%
NOT                       1   0%
OR                        1   0%
TOTAL                  1000 100%
Regarding the tool, it's clearly outdated (last update was on 2011), and I think that it TRIES to simulate the behavior of an Intel processor (maybe Sandy Bridge).

For this reason, the data about uops and micro-ops are at least biased.

I also found some errors, like this:
Code:
1 48becc -1 -1 45 - - -   5887758            0       48bed2            0 CMP      SAVE_PC
2 48becc -1 45 45 - - L         0       a295e0       48bed2            0 CMP      LOAD
3 48becc  5 45 44 W - -         0            0       48bed2            0 CMP      SUB
which doesn't make sense for a CMP instruction. Specifically, CMP needs at most 2 uops to be executed on Sandy Bridge (according to Agner).

A similar error is reported for the CALL:
Code:
1 48d248 -1 -1 44 - - -         0            0       48d24d            0 CALL     SAVE_PC
2 48d248  4 -1  4 - - -         8            0       48d24d            0 CALL     SUB_IMM
3 48d248 44  4 -1 - - S         0  7fffe7fefc8       48d24d            0 CALL     STORE
4 48d248 -1 -1 -1 - T -     -5149            0       48d24d       48be30 CALL     JMP_IMM
A near call needs 3 uops, not 4. 4 are required only for call using a memory operand.

Finally, a text format might be good for small traces, but for big ones I greatly prefer a more compact binary format, which is also much easier to parse in parallel, using all available cores/threads (which is required for millions of traces).
 
Reactions: Dresdenboy

cdimauro

Member
Sep 14, 2016
163
14
61
There are a few "interesting points" that we can see with Zen for ST:
Does the 6 small integer schedulers have a impact on extractable ILP vs one big one, if so how much.
IMO a big scheduler is able to extract more ILP, because it can "instantly" have a view of the situation with the ports & dependencies & stalls, and take the proper decisions.

I have difficulties thinking how a system with multiple queues/schedulers can deal with the same things, while keeping an high ILP.

But that's a limit of mine.
A lot more of the questions are actually around SIMD performance, this whole stack engine discusions is really about having enough load/store and address generation for very heavy sustained amounts of 3 operand operations (AVX, FMA).
On top of this add also that 256-bit instructions basically might "halve" the throughput of the Decoder, since 2 macro-ops are generated. That's if Zen follows the same strategy of Steamroller.
Larger FP register file:
More execution resources
lower execution latency ( except for FMA)
lower load to use latency
a buffer pre FPU scheduler so ops can still be sent to FPU even if FPU scheduler is full
A couple of things here. First, FPU instructions with memory operands need to pass through the "INT/AGU" section, and we don't know if/how much it hurts performances. Second, there are FPU instructions which use "integer" registers, so the execution has to split in some way, with a corresponding penalty.
 
Last edited:

bjt2

Senior member
Sep 11, 2016
784
180
86
The terms Macro-op and micro-op let me think that is an AMD CPU, mabye pre-Bulldozer...
I ever thought that INTEL has only simple uops...
 

cdimauro

Member
Sep 14, 2016
163
14
61
Bulldozer arrived on October 2011, whereas the last update on the page was made on September 2011.

EDIT: taking a look at K10 & Bulldozer, there's nothing for CMP & CALL instructions which resembles what was reported in the traces.
 
Last edited:

Nothingness

Platinum Member
Jul 3, 2013
2,769
1,429
136
IMO a big scheduler is able to extract more ILP, because it can "instantly" have a view of the situation with the ports & dependencies & stalls, and take the proper decisions.

I have difficulties thinking how a system with multiple queues/schedulers can deal with the same things, while keeping an high ILP.
The problem is that a unified scheduler has a huge complexity. Even simple heuristics, such as pick oldest instructions, are hard to implement with low latency. I wouldn't be surprised if Intel scheduler was the most complex and tuned block of their CPU.

I'm not convinced the difference against split schedulers is big enough to justify a huge investment a full custom block requires.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
I have difficulties thinking how a system with multiple queues/schedulers can deal with the same things, while keeping an high ILP.
i would assume there is a scheduler of schedulers that has a high level view of what each sub schedules queue looks like. So i really have no idea how much that will hurt ILP if at all. A question i just thought of is would the scheduler of schedulers be FIFO or more complex?

On top of this add also that 256-bit instructions basically might "halve" the throughput of the Decoder, since 2 macro-ops are generated. That's if Zen follows the same strategy of Steamroller.
Im sure there was a link somewhere ( i think posted by Dresdenboy) that Michael clarke said that it leaves decode as one Mop and is split latter on, so to me that sounds more like 1 decoder. Intel prob shared AVX spec with AMD to late for them to accommodate it in this way for CON core, You would need to accommodate this both in the schema/format of the Mop's and also in decode/issue/etc.

A couple of things here. First, FPU instructions with memory operands need to pass through the "INT/AGU" section, and we don't know if/how much it hurts performances. Second, there are FPU instructions which use "integer" registers, so the execution has to split in some way, with a corresponding penalty.
This is the same for all AMD uarch, except it was worse for bulldozer as it had that 2 core/FPU LSU interconnect thing going on.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Finally found some time...

Thanks. I took a look at the same trace, but I've something to say something about the binary code and tool itself.

I suppose that trace comes from a code which is compiled by GCC itself, however it doesn't look so much optimized. I saw some word (16-bit code) usage (with 64-bit code!), which usually is slow, and in general a more x86-like approach for function calls, with pushes and pops. In fact, I don't see a good usage of the available registers.
Thanks for the statistics (also on your blog). These are always interesting and help in discussing uarch topics.

The implications of a specific compiler and its settings create some bias, but there shouldn't be too much of a difference between this code and that of a newer GCC variant in amount of loops and stack/mem ops, as the code, the available architectural registers and the 64bit ABI are still the same.

Regarding the tool, it's clearly outdated (last update was on 2011), and I think that it TRIES to simulate the behavior of an Intel processor (maybe Sandy Bridge).

For this reason, the data about uops and micro-ops are at least biased.

I also found some errors, like this:
Code:
1 48becc -1 -1 45 - - -  5887758  0  48bed2  0 CMP  SAVE_PC
2 48becc -1 45 45 - - L  0  a295e0  48bed2  0 CMP  LOAD
3 48becc  5 45 44 W - -  0  0  48bed2  0 CMP  SUB
which doesn't make sense for a CMP instruction. Specifically, CMP needs at most 2 uops to be executed on Sandy Bridge (according to Agner).

A similar error is reported for the CALL:
Code:
1 48d248 -1 -1 44 - - -  0  0  48d24d  0 CALL  SAVE_PC
2 48d248  4 -1  4 - - -  8  0  48d24d  0 CALL  SUB_IMM
3 48d248 44  4 -1 - - S  0  7fffe7fefc8  48d24d  0 CALL  STORE
4 48d248 -1 -1 -1 - T -  -5149  0  48d24d  48be30 CALL  JMP_IMM
A near call needs 3 uops, not 4. 4 are required only for call using a memory operand.

Finally, a text format might be good for small traces, but for big ones I greatly prefer a more compact binary format, which is also much easier to parse in parallel, using all available cores/threads (which is required for millions of traces).

There is a more detailed description of the code and the simulator in Andrew Hilton's dissertation:
Energy Effcient Load Latency Tolerance: Single-thread Performance For the Multi-Core Era
http://repository.upenn.edu/cgi/viewcontent.cgi?article=1311&context=edissertations

Andrew Hilton said:
The simulator executes the user-level portions of statically linked 64-bit x86 programs. It decodes x86 macro instructions and cracks them into a RISC-style μop ISA. The μop cracking used in Core i7 is proprietary, so there is no way to tell how similar the μop ISAs are.

IMO a big scheduler is able to extract more ILP, because it can "instantly" have a view of the situation with the ports & dependencies & stalls, and take the proper decisions.

I have difficulties thinking how a system with multiple queues/schedulers can deal with the same things, while keeping an high ILP.

But that's a limit of mine.

On top of this add also that 256-bit instructions basically might "halve" the throughput of the Decoder, since 2 macro-ops are generated. That's if Zen follows the same strategy of Steamroller.

A couple of things here. First, FPU instructions with memory operands need to pass through the "INT/AGU" section, and we don't know if/how much it hurts performances. Second, there are FPU instructions which use "integer" registers, so the execution has to split in some way, with a corresponding penalty.
There are patents which mention a method of putting uops, which depend on the output of older uops, into the same scheduler queue as the older ones. These could then be executed in order, as a small dependency chain, created by a small part of the code DAG. Such a scheduler doesn't need to track all other schedulers, as they all might broadcast the availability and source (PRF, bypass network, imm/disp storage, flag PRF) of results via tags.

Other patents talk about "load-operations", which seem to be a single uop, which gets issued two times. But this only works with unified schedulers, as separate ones can't exchange uops. Especially with FPU Ex + AGU ops this is not possible.

But based on the different patents' filing dates Zen+ might get a unified int scheduler.

The terms Macro-op and micro-op let me think that is an AMD CPU, mabye pre-Bulldozer...
I ever thought that INTEL has only simple uops...
You might also see the term "Macro-op" to stand for the CISC op. Indeed, these terms have been used interchangeably in a way, that it might become difficult to tell the differences. For the source of the uops, see my answer to cdimauro above.

i would assume there is a scheduler of schedulers that has a high level view of what each sub schedules queue looks like. So i really have no idea how much that will hurt ILP if at all. A question i just thought of is would the scheduler of schedulers be FIFO or more complex?
I think that job you described here is that of the renamer. The individual schedulers still check the available sources in their arrays, which get updated by broadcasted tags.

Im sure there was a link somewhere ( i think posted by Dresdenboy) that Michael clarke said that it leaves decode as one Mop and is split latter on, so to me that sounds more like 1 decoder. Intel prob shared AVX spec with AMD to late for them to accommodate it in this way for CON core, You would need to accommodate this both in the schema/format of the Mop's and also in decode/issue/etc.
Here's the link. I cited him over at S|A.
http://semiaccurate.com/forums/showpost.php?p=273694&postcount=20
 

Zstream

Diamond Member
Oct 24, 2005
3,396
277
136
Can we talk about architecture in the other thread? Looking for benchmarks and see very little :*(
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |