Criteria for determining CPU issue width

Carfax83 · Jun 30, 2019

Can someone who is informed on these matters tell me what criteria is used to determine CPU issue width, or how many instructions a CPU can issue at once? From some light reading, I thought it referred to the amount of decoders, which would make Zen 2 a 4 issue wide CPU. But then I read a reddit post by looncraz where he stated that Zen 2 was 8-issue wide and I was like !

Looncraz doesn't strike me as the sort to not know what he's talking about, so more than likely I'm the one who's incorrect.

Vattila · Jun 30, 2019

Interesting question. I'm not an expert and cannot right now be bothered to do any research on the correct terminology, but it if you are interested in "issue width" to gauge how many instructions a core is able to handle per clock tick, it is obvious that the decode width is not a good measure, for the reason that modern x86 cores buffers decoded instructions in a micro-op cache. So you can have small code sequences, in particular tight loops (such a memory copies, simple SIMD operations on large data streams) executing straight from the micro-op cache without any pressure on the decoders at all.

Another view of the width of the core is the number of execution units that can be engaged in the same clock cycle. But in the end, I suppose the better measure of execution width is how many instructions the core can retire (i.e. complete) per clock cycle. It will have a theoretical maximum number, but in practice a lower actual number in real code, due to stalls on memory accesses and contention of resources.

In any case, beware of putting to much significance on the theoretical optimal numbers. There are performance counters in modern processors to measure these things on real code sequences.

inf64 · Jun 30, 2019

Look here:
https://www.agner.org/optimize/microarchitecture.pdf

For Zen:
The maximum throughput of six μops per clock cycle can be obtained if at least half of the instructions generate two μops each
Later on : Eight μops can retire in one clock cycle.

^^ This is for Zen1(+). For Zen2 I guess 8 might be the new limit for issue. Retire width was also increased but AMD did not state by how much. Zen1 was already a wider core than Skylake, Zen2 is even wider. Icelake might bring intel back to parity there.

Edit:
From page 221:
Bottlenecks in AMD Ryzen
The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the μop cache can have a throughput of five instructions or six μops per clock cycle. Code that does not fit into the μop cache can have a throughput of four instructions or six μops or approximately 16 bytes of code per clock cycle, whichever is smaller.The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.Most instructions are supported by two, three, or four execution units so that it is possible to execute multiple instructions of the same kind simultaneously.Instruction latencies are generally low.The 256-bit vector instructions are split into two μops each. A piece of code that contains many 256-bit vector instructions may therefore be limited by the number of execution units. The maximum throughput here is four vector-μops per clock cycle, or six μops per clock cyclein totalif at least a third of the μops use general purpose registers.

Carfax83 · Jul 1, 2019

Vattila said:
Interesting question. I'm not an expert and cannot right now be bothered to do any research on the correct terminology, but it if you are interested in "issue width" to gauge how many instructions a core is able to handle per clock tick, it is obvious that the decode width is not a good measure, for the reason that modern x86 cores buffers decoded instructions in a micro-op cache. So you can have small code sequences, in particular tight loops (such a memory copies, simple SIMD operations on large data streams) executing straight from the micro-op cache without any pressure on the decoders at all.

Yes, this explains it well thanks. When looncraz was referring to Zen 2 being 8 issue wide, I now believe he must have been referring to the buffered decoded instructions in the micro-op cache.

Carfax83 · Jul 1, 2019

inf64 said:
Look here:
https://www.agner.org/optimize/microarchitecture.pdf

Cripes, the amount of work that must have gone into that document is staggering!

For Zen:
The maximum throughput of six μops per clock cycle can be obtained if at least half of the instructions generate two μops each
Later on : Eight μops can retire in one clock cycle.

^^ This is for Zen1(+). For Zen2 I guess 8 might be the new limit for issue. Retire width was also increased but AMD did not state by how much. Zen1 was already a wider core than Skylake, Zen2 is even wider. Icelake might bring intel back to parity there.

Hey thanks for the info man. This is the thread on reddit where I saw that comment from looncraz:

https://www.reddit.com/r/Amd/comments/bm9ti0/_/emv3rpo

cherullo · Jul 1, 2019

This debate rages on exactly because no single number can characterize the whole pipeline, so it really boils down to what you think is actually important.

Talking about Zen1, if your code is coming from the instruction cache, passing through the decoders and not hitting any pipeline stalls, then it'll be limited to 4 instructions per clock.
If your code is coming from the uop cache and not hitting any pipeline stalls then it'll be limited to 6 uops per clock (notice that this may represent less than 6 instructions).

If your code has just the right mix of instructions and for some reason stalled waiting for operands (which may arrive either from memory or as results of other instructions) then in the best case you'll be able to issue 10 uops per clock (4 ALUs, 2 AGUs, 4 FPUs - the reservation stations seem to be mostly independent). Notice that in this scenario the CPU is working hard to compensate for the stall, it cannot sustain this issue rate forever.

If a number of instructions in your code have been executed but is waiting on the ROB to be retired (since it must be retired in order, a single slow instruction can hold the retiment of hundred others), then the retirement limit is 8 instructions per clock. Once again, the CPU is working to compensate a previous stall, it cannot sustain this retirement rate indefinitely.

I don't know why looncraz said that Zen2 is 8 issue. Mike Clark himself says it's 7-issue on the integer side alone:

itsmydamnation · Jul 1, 2019

cherullo said:
I don't know why looncraz said that Zen2 is 8 issue. Mike Clark himself says it's 7-issue on the integer side alone:

Because it is, 7 issue in that slide is talking about integer execution, not total number of uops that can dispatched to the CPU execution stage ( int+ fp) in one cycle.

moinmoin · Jul 1, 2019

Carfax83 said:
https://www.reddit.com/r/Amd/comments/bm9ti0/_/emv3rpo

The Zen 3 being SMT4 rumor has been going around for ages now, and looncraz' argument of Zen 2 essentially being wide enough for that aside further polishing is a solid one. Looks like Zen 3 won't add more cores like Zen and Zen 2 did, instead it will just casually double the amount of threads.

TheELF · Jul 1, 2019

moinmoin said:
The Zen 3 being SMT4 rumor has been going around for ages now, and looncraz' argument of Zen 2 essentially being wide enough for that aside further polishing is a solid one. Looks like Zen 3 won't add more cores like Zen and Zen 2 did, instead it will just casually double the amount of threads.

How do you figure?
SMT 4 would drop performance of each thread to 1/4 of what it is now, or 1/2 whichever way you see it, and that is with perfect software and perfect hardware implementation,real world would be a disaster.
You would be able to handle more threads at once but only if they used far less instructions each.

Thunder 57 · Jul 1, 2019

TheELF said:
How do you figure?
SMT 4 would drop performance of each thread to 1/4 of what it is now, or 1/2 whichever way you see it, and that is with perfect software and perfect hardware implementation,real world would be a disaster.
You would be able to handle more threads at once but only if they used far less instructions each.

POWER can see benefits from SMT 4 and SMT 8. Would it work on x86? Maybe.I'm a bit skeptical of Zen 3 being SMT 4 to begin with though. I'd like to know the origins of that rumor.

Also, I think you need to read up a bit on SMT.

EDIT

Here is a link to some analysis of POWER with varying SMT:

https://www.anandtech.com/show/10435/assessing-ibms-power8-part-1/4

The results are later in the article. Basically, huge gains from SMT 2, everything improved using SMT 4, SMT 8 added some in certain tests.

Vattila · Jul 1, 2019

TheELF said:
You would be able to handle more threads at once but only if they used far less instructions each.

True. SMT4 will perform best when each thread is stalling much of the time, e.g. due to unpredictable memory accesses.

By the way, in SMT mode, the core has some resources dynamically allocated to each thread, while other resources are statically allocated. Obviously, when a single thread is running you want the core to give all resources to that thread. How does the core, and/or the OS, switch a core from single-thread to SMT mode to ensure this? Anyone know and care to explain?

Kenmitch · Jul 1, 2019

I saw this article about the IBM Power9 4c/16t and it didn't look impressive to me. Maybe the lack of real cores holds it back?

https://www.phoronix.com/scan.php?page=article&item=blackbird-power9-4c&num=1

Yeroon · Jul 1, 2019

Is there a reason they couldn't do a 3 threads per core approach vs a 4 thread per core as is being speculated?

Vattila · Jul 1, 2019

Yeroon said:
Is there a reason they couldn't do a 3 threads per core approach vs a 4 thread per core as is being speculated?

3 is not a power of 2. You need two bits, or transistors, or signal lanes, to represent the thread ID if you go to 3 or 4 threads, so you might as well go to 4, I guess.

Yeroon · Jul 1, 2019

Vattila said:
3 is not a power of 2. You need two bits, or transistors, or signal lanes, to represent the thread ID if you go to 3 or 4 threads, so you might as well go to 4, I guess.

Thanks. That makes sense.

cherullo · Jul 1, 2019

itsmydamnation said:
Because it is, 7 issue in that slide is talking about integer execution, not total number of uops that can dispatched to the CPU execution stage ( int+ fp) in one cycle.

From Anand's Zen2 analysis, page 8:

Going beyond the decoders, the micro-op queue and dispatch can feed six micro-ops per cycle into the schedulers. This is slightly imbalanced however, as AMD has independent integer and floating point schedulers: the integer scheduler can accept six micro-ops per cycle, whereas the floating point scheduler can only accept four. The dispatch can simultaneously send micro-ops to both at the same time however.

So still 6 uops dispatched, just like Zen1.
On the other hand, the more than doubling of the uop-cache allows it to provide 6 uops per cycle more often, while the apparent doubling of the branch predictor aims to prevent fetching wrong instuctions. So the attained throughput in practice will be much higher.

TheELF · Jul 2, 2019

Thunder 57 said:
POWER can see benefits from SMT 4 and SMT 8. Would it work on x86? Maybe.I'm a bit skeptical of Zen 3 being SMT 4 to begin with though. I'd like to know the origins of that rumor.

Also, I think you need to read up a bit on SMT.

EDIT

Here is a link to some analysis of POWER with varying SMT:

https://www.anandtech.com/show/10435/assessing-ibms-power8-part-1/4

The results are later in the article. Basically, huge gains from SMT 2, everything improved using SMT 4, SMT 8 added some in certain tests.

What exactly should I read up on?
They are saying exactly what I said with different words.
It only makes sense if you run a huge amount of low IPC stuff,something that is pretty much completely useless for desktop since even a plain quad core can handle intense windows sessions (that are not running heavy IPC loads like rendering or streaming)

If you superficially look at what kind of parallelism can be found in software, it starts to look like a suicidal move. Indeed on average, most modern CPU compute on average 2 instructions per clockcycle when running spam filtering (perlbench), video encoding (h264.ref) and protein sequence analyses (hmmer). Those are the SPEC CPU2006 integer benchmarks with the highest Instruction Per Clockcycle (IPC) rate. Server workloads are much worse: IPC of 0.8 and less are not an exception.

It is clear that simply widening a design will not bring good results, so IBM chose to run up to 8 threads simultaneously on their core. But running lots of threads is not without risk: you can end up with a throughput processor which delivers very poor performance in a wide range of applications

Vattila · Jul 2, 2019

TheELF said:
[SMT4] only makes sense if you run a huge amount of low IPC stuff,something that is pretty much completely useless for desktop

SMT4, if implemented, will be squarely aimed at server and (some) workstation workloads. The wider core will benefit IPC for single and lightly threaded workloads on desktop. At least, that's how I see it.

PS. The SMT mode, i.e. SMT4 or SMT2, may be configurable in BIOS or an fused-off feature at manufacture. Ryzen SKUs may have SMT2 with two fat threads, while EPYC has the option of SMT4 with thinner threads.

itsmydamnation · Jul 2, 2019

cherullo said:
From Anand's Zen2 analysis, page 8:

So still 6 uops dispatched, just like Zen1.
On the other hand, the more than doubling of the uop-cache allows it to provide 6 uops per cycle more often, while the apparent doubling of the branch predictor aims to prevent fetching wrong instuctions. So the attained throughput in practice will be much higher.

Could have sworn i saw something saying dispatch was upto 8 to match retirement, but i went looking and couldn't find it.

Vattila · Jul 2, 2019

itsmydamnation said:
Could have sworn i saw something saying dispatch was upto 8 to match retirement, but i went looking and couldn't find it.

Intuitively it makes sense to have some extra retirement capacity. For example, say you have the full 6 micro-ops issued to execution units and advancing into the re-order buffer (ROB) where they will be committed in order. However, the buffer may still be waiting for prior instructions to commit results. When these finally are ready to retire, the capability to retire 8 micro-ops per clock tick allows the core to catch up. However, I don't know if this is correct understanding.

Thunder 57 · Jul 2, 2019

TheELF said:
What exactly should I read up on?
They are saying exactly what I said with different words.
It only makes sense if you run a huge amount of low IPC stuff,something that is pretty much completely useless for desktop since even a plain quad core can handle intense windows sessions (that are not running heavy IPC loads like rendering or streaming)

Maybe I read it wrong, but they way you put it sounded like SMT will lower your per thread performance by 1/2 or 1/4. That is clearly not true. SMT attempts to make use of resources that are otherwise not being used. It CAN lower performance, but that is rare and usually only by a tiny amount when it does.

naukkis · Jul 2, 2019

Thunder 57 said:
Maybe I read it wrong, but they way you put it sounded like SMT will lower your per thread performance by 1/2 or 1/4. That is clearly not true. SMT attempts to make use of resources that are otherwise not being used. It CAN lower performance, but that is rare and usually only by a tiny amount when it does.

SMT gains are not 100 and 300 percent. With 2-thread SMT you trade one 100% speed thread to two 60% speed threads to have overall combined speed of 120%. Invidual threads will be slower when running concurrently on same cpu core.

Markfw · Jul 2, 2019

naukkis said:
SMT gains are not 100 and 300 percent. With 2-thread SMT you trade one 100% speed thread to two 60% speed threads to have overall combined speed of 120%. Invidual threads will be slower when running concurrently on same cpu core.

What CPU do you have ? Ryzen gets more like 130-140%

Thunder 57 · Jul 2, 2019

naukkis said:
SMT gains are not 100 and 300 percent. With 2-thread SMT you trade one 100% speed thread to two 60% speed threads to have overall combined speed of 120%. Invidual threads will be slower when running concurrently on same cpu core.

No, they definitely are not, nor did I say as such. I really think you guys need to read about SMT. There are no set percentages. It all depends on the code, and the schedulers have done fair enough work with that. I would say great work, if not for the TR 2950X / 2990WX discrepancy. Some blame memory bandwidth, but Linux seems to work fine with a 2990WX.

naukkis · Jul 2, 2019

Markfw said:
What CPU do you have ? Ryzen gets more like 130-140%

It was just a example. With SMT you got more slower threads with increased thread count, which combined netted more throughput than without SMT. The point was, the more thread per core the slower those individual threads will be and it's not ideal solution for every case, most workloads prefer faster individual threads - SMT for desktop and mobile oriented workloads just isn't good solution.

Criteria for determining CPU issue width

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Member

Senior member

Member

Member

Diamond Member

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Moderator Emeritus, Elite Member

Platinum Member

Senior member