Ryzen: Strictly technical

icelight_ · Mar 30, 2017

looncraz said:
RE: AVX, that actually makes sense -the uop, retire, and store queues are statically shared but rarely used for one work load as much as they can be with SIMD. AVX, in particular, could be like issuing twice as many uops for the same task as normal, making the penalty come to light in a very clear manner.

Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would then be the load and store queues.

This could then lead to situations where both threads are stalled, waiting for memory, but a single one with a larger uop queue would not be. For this to become significant, a hot path would have to cause this issue regularly, which should be noticable by lower power consumption (and corresponding values in the perf counters).

looncraz · Mar 30, 2017

imported_jjj said:
Is that with the latest AIDA64? And if it is , have results changed in any of the tests, aside from L3$.

The difference between the new version and the older versions of AIDA is much less than the difference between CL15 and CL16 DDR4-2667 on Ryzen. CL14 is an even larger difference.

My IPC figures are based on CL15 DDR4-2667 2T as that represents a nice middle ground.

The sensitivity to this was something I didn't expect, but it helps to explain the extreme variability around the web.

looncraz · Mar 30, 2017

icelight_ said:
Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would then be the load and store queues.

This could then lead to situations where both threads are stalled, waiting for memory, but a single one with a larger uop queue would not be. For this to become significant, a hot path would have to cause this issue regularly, which should be noticeable by lower power consumption (and corresponding values in the perf counters).

Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.

icelight_ · Mar 30, 2017

looncraz said:
Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.

I'm not quite sure i can follow. Why wouldn't it be possible to dispatch one complete avx op (2 uops) and half of the next one (+ 1 = 3 uops)? Assuming fetch/decode is fast enough, there should be enough buffered, and if not, the wider dispatch wouldn't have helped anyway?

looncraz · Mar 30, 2017

icelight_ said:
I'm not quite sure i can follow. Why wouldn't it be possible to dispatch one complete avx op (2 uops) and half of the next one (+ 1 = 3 uops)? Assuming fetch/decode is fast enough, there should be enough buffered, and if not, the wider dispatch wouldn't have helped anyway?

State sanity, predominantly, but that's a complicated matter that is very dependent on how the FastPath is implemented - and you can bet that varies by each individual instruction (dependent uops, for example, will, in effect, stay paired and are tagged for concurrent retirement - they can usually make their way however after the dispatcher, but they enter it together and retire together).

Still, even if you can issue 1.5 AVX instructions per cycle, you still have only half the instruction issue rate as without SMT. 3 uops vs 6, which will certainly have a negative impact greater than, or at least equal to, the 8-wide retire and 22-deep store queue partition (per thread). Ryzen writes back to the L1D 128-bits/cycle and the L1 writes back to the L2 with twice the width, nearly maintaining the same bandwidth as the L1D, so the data should have no issue with movement after execution.

The only time I see that flushing being an issue is when working with volatile data (flushed all the way to system memory).

Kromaatikse · Mar 30, 2017

Hold on a minute - why would the dispatch bandwidth be statically partitioned? Or, perhaps, is there a source to confirm that?

Dresdenboy · Mar 30, 2017

looncraz said:
Yeah, but VP8 almost always scales with frequency more than anything else. The 4Ghz quad-core 6700k beats the 3.6GHz six-core 6850k - 7521 to 6932... which is fully explainable with the frequency difference.

It's pretty strange considering the test nearly maxes out every core (16 threads at 85~90%)... but that might be explainable if the VP8 test itself is actually single threaded and it has locking overhead that slows down the Julia output, meaning performance is being restricted by locking contention... less threads = less contention = higher score.

RE: AVX, that actually makes sense -the uop, retire, and store queues are statically shared but rarely used for one work load as much as they can be with SIMD. AVX, in particular, could be like issuing twice as many uops for the same task as normal, making the penalty come to light in a very clear manner.

AVX 256b uops are split up in the dispatch unit. Before they are represented by 1 uop.

looncraz said:
Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.

I think statically partitioned resources are vertically split, e.g. one dispatch packet for one thread per cycle, or upper and lower half of the PRFs per thread. Splitting between issue ports sounds strange.

deadhand · Mar 30, 2017

https://twitter.com/FioraAeterna/status/847472586581712897
https://twitter.com/FioraAeterna/status/847472836344033280

Changes that improved performance in Ashes were not related to thread scheduling, but rather fixing false dependencies caused by the use of an SSE instruction, apparently.

lobz · Mar 30, 2017

coercitiv said:
I was only talking from an image quality point of view.

how dare you..!!!11

tamz_msc · Mar 30, 2017

deadhand said:
https://twitter.com/FioraAeterna/status/847472586581712897
https://twitter.com/FioraAeterna/status/847472836344033280

Changes that improved performance in Ashes were not related to thread scheduling, but rather fixing false dependencies caused by the use of an SSE instruction, apparently.

So is it due to the compiler or something else?

deadhand · Mar 30, 2017

tamz_msc said:
So is it due to the compiler or something else?

Seems like they were using _mm_stream intrinsics which write directly to RAM.

EDIT: I'm not going to pretend that I really know what's going on with regards to this, I've never heard of this issue before, but my theory that false dependencies cause inordinate problems for Ryzen might not have been too far off.

tamz_msc · Mar 30, 2017

deadhand said:
Seems like they were using _mm_stream intrinsics which write directly to RAM.

http://stackoverflow.com/questions/21207023/stream-intrinsic-degrades-performance

EDIT: I'm not going to pretend that I really know what's going on with regards to this, I've never heard of this issue before, but my theory that false dependencies cause inordinate problems for Ryzen might not have been too far off.

Ha, I'm far from comprehending what's going on myself, but it seems to be a compiler thing:

Apparently the Intel compiler is smart enough to detect the access pattern and automatically generates non-temporal loads even for the temporal version.

So AotS is built with the Intel compiler? That would explain its scalability with #of cores.

Seems plausible given the comment they made on the AMD blog.

EDIT: It was actually a quote from the PCPer article.

icelight_ · Mar 30, 2017

The linked stackoverflow post doesn't really seem to be relevant to the ryzen issue.

deadhand · Mar 30, 2017

icelight_ said:
The linked stackoverflow post doesn't really seem to be relevant to the ryzen issue.

More just a run-down of what it does rather than any related performance issues. As I gather the performance issues with this is fairly unique to Ryzen.

[removed]

icelight_ · Mar 30, 2017

deadhand said:
More just a run-down of what it does rather than any related performance issues. As I gather the performance issues with this is fairly unique to Ryzen.

EDIT: What I'm hearing is (paraphrasing) "the issue is caused by the buffer being flooded and spilled due to pending writes not being written out fast enough, due to waiting on false dependencies"

Linking to a different performance issue with non-temporal memory access just seemed to be asking for misunderstandings, and apparently also led to one. At least i can't see anything suggesting that AotS is compiled with intel's compiler.
I'm also interested in a source on the false dependency thing, or is that something you personally know?

Despoiler · Mar 30, 2017

The programmer tells us what the issue was. MOVNT are SSE instructions. Non-temporals are things that run and produce data, but are not immediately consumed. Their code was flushing the entire chip which made Ryzen run way slow.

The top answer gives a good explanation with a link to the programmer's guide
http://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86

kostarum · Mar 30, 2017

Can anyone tell me what are suggestions for future Ryzen optimizations and bug fixing?

Is there anything that will help without to wait for Zen 2 architecture?

Gesendet von meinem GT-I9295 mit Tapatalk

Kromaatikse · Mar 30, 2017

There are three main areas of progress in the medium term, I think:

1: AMD and m/board vendors will continue to develop BIOS updates, among other things improving RAM compatibility and various internal timings.

2: Software can continue to eliminate small, disproportionately problematic code sequences like the one found in AotS. As I understand it, MOVNT is basically deprecated these days anyway.

3: Multithreaded software can find ways to minimise unnecessary inter-thread or inter-CCX data movement which tickles the Infinity Fabric throughput limitation, and also make the most appropriate use of SMT while avoiding its inherent drawbacks.

deadhand · Mar 30, 2017

icelight_ said:
Linking to a different performance issue with non-temporal memory access just seemed to be asking for misunderstandings, and apparently also led to one. At least i can't see anything suggesting that AotS is compiled with intel's compiler.
I'm also interested in a source on the false dependency thing, or is that something you personally know?

I don't think it really suggested that it was an issue with the Intel compiler, just that the Intel compiler does some things differently. The purpose of linking the stack overflow question was that it gives a bit of an explanation of how it works (writing directly to ram), as well as that there can be fairly severe performance problems with it. I thought it was somewhat relevant, but I should have found a better link.

I can remove it.

looncraz · Mar 30, 2017

Kromaatikse said:
Hold on a minute - why would the dispatch bandwidth be statically partitioned? Or, perhaps, is there a source to confirm that?

Likely, though, the "vertically threaded" words means every other line in the queue belongs to a different thread (which is more Bulldozer-like... and partly negates what I was saying in my woefully sleep deprived state), so there's three potential instructions every cycle on average (CPU-z actually seems to hit this limit, though I haven't tried it with SMT disabled... because I'm stupid ).

Not as good as six uops per cycle, of course, but most apps can only ever hope to reach 2uops/cycle.

keymaster151 · Mar 31, 2017

Interesting results on Ashes of the Singularity in real gameplay with the new patch here: http://www.pcgameshardware.de/Ryzen...ecials/AMD-AotS-Patch-Test-Benchmark-1224503/

While the test does show a nice increase in fps, it's still way below the 7700k, especially in min fps (about 2.5x on the 7700k.) Can anyone test these results? If they are accurate, it would mean that AotS is by far the worst performing game on Ryzen that I've seen.

looncraz · Mar 31, 2017

Disregard

Dresdenboy · Mar 31, 2017

looncraz said:
Likely, though, the "vertically threaded" words means every other line in the queue belongs to a different thread (which is more Bulldozer-like... and partly negates what I was saying in my woefully sleep deprived state), so there's three potential instructions every cycle on average (CPU-z actually seems to hit this limit, though I haven't tried it with SMT disabled... because I'm stupid ).

Not as good as six uops per cycle, of course, but most apps can only ever hope to reach 2uops/cycle.

I think that's just an ILP limitation of the code. The formerly Ryzen crashing flops code has some paths which should go higher than 3 in ILP (e.g. FMUL+FADD) with 1T.

Mix that with Int ops and you might see 4+ with 1T and in sum with 2T.

looncraz · Mar 31, 2017

Dresdenboy said:
I think that's just an ILP limitation of the code. The formerly Ryzen crashing flops code has some paths which should go higher than 3 in ILP (e.g. FMUL+FADD) with 1T.

Mix that with Int ops and you might see 4+ with 1T and in sum with 2T.

It would seem you're right. I tested it without SMT enabled and IPC maxed at 3.19 on a single core.

That got me to thinking, though, of finding a benchmark which could do better. GeekBench 3 actually manages it. Without SMT, it hits a per-thread IPC of 4.05 (average of the peak IPC seen across all threads). With SMT, it only manages 3.1.

I measured IPC with StatusCore.

This actually makes a LOT of sense given the SMT scaling.

4.05 IPC * 8 Cores = 32.4 IPC (MT)
3.1 IPC * 16 Cores = 49.6 IPC (SMT)

SMT peak instruction throughput improvement: 53%

I will need to hunt for something that can really push more IPC... any ideas?

Atari2600 · Mar 31, 2017

keymaster151 said:
While the test does show a nice increase in fps, it's still way below the 7700k, especially in min fps (about 2.5x on the 7700k.)

But its broadly comparable to the 6900k. Why the difference between the 8C machines and the Kaby Lake?

Ryzen: Strictly technical

Junior Member

Senior member

Senior member

Junior Member

Senior member

Member

Golden Member

Junior Member

Platinum Member

Diamond Member

Junior Member

Diamond Member

Junior Member

Junior Member

Junior Member

Golden Member

Junior Member

Member

Junior Member

Senior member

Junior Member

Senior member

Golden Member

Senior member

Golden Member