Ryzen: Strictly technical

Page 54 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

icelight_

Junior Member
Mar 19, 2017
5
1
51
RE: AVX, that actually makes sense -the uop, retire, and store queues are statically shared but rarely used for one work load as much as they can be with SIMD. AVX, in particular, could be like issuing twice as many uops for the same task as normal, making the penalty come to light in a very clear manner.

Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would then be the load and store queues.

This could then lead to situations where both threads are stalled, waiting for memory, but a single one with a larger uop queue would not be. For this to become significant, a hot path would have to cause this issue regularly, which should be noticable by lower power consumption (and corresponding values in the perf counters).
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Is that with the latest AIDA64? And if it is , have results changed in any of the tests, aside from L3$.

The difference between the new version and the older versions of AIDA is much less than the difference between CL15 and CL16 DDR4-2667 on Ryzen. CL14 is an even larger difference.



My IPC figures are based on CL15 DDR4-2667 2T as that represents a nice middle ground.

The sensitivity to this was something I didn't expect, but it helps to explain the extreme variability around the web.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Why would using AVX issue more uops than normal operation? At least on intel most AVX instructions decompose into one or maybe two uops, while performing 4 times (for 4x32 vectors) the work. For equal throughput, pressure on uop queue and retire queue would be reduced a lot. The bottleneck would then be the load and store queues.

This could then lead to situations where both threads are stalled, waiting for memory, but a single one with a larger uop queue would not be. For this to become significant, a hot path would have to cause this issue regularly, which should be noticeable by lower power consumption (and corresponding values in the perf counters).

Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.
 

icelight_

Junior Member
Mar 19, 2017
5
1
51
Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.
I'm not quite sure i can follow. Why wouldn't it be possible to dispatch one complete avx op (2 uops) and half of the next one (+ 1 = 3 uops)? Assuming fetch/decode is fast enough, there should be enough buffered, and if not, the wider dispatch wouldn't have helped anyway?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I'm not quite sure i can follow. Why wouldn't it be possible to dispatch one complete avx op (2 uops) and half of the next one (+ 1 = 3 uops)? Assuming fetch/decode is fast enough, there should be enough buffered, and if not, the wider dispatch wouldn't have helped anyway?

State sanity, predominantly, but that's a complicated matter that is very dependent on how the FastPath is implemented - and you can bet that varies by each individual instruction (dependent uops, for example, will, in effect, stay paired and are tagged for concurrent retirement - they can usually make their way however after the dispatcher, but they enter it together and retire together).

Still, even if you can issue 1.5 AVX instructions per cycle, you still have only half the instruction issue rate as without SMT. 3 uops vs 6, which will certainly have a negative impact greater than, or at least equal to, the 8-wide retire and 22-deep store queue partition (per thread). Ryzen writes back to the L1D 128-bits/cycle and the L1 writes back to the L2 with twice the width, nearly maintaining the same bandwidth as the L1D, so the data should have no issue with movement after execution.

The only time I see that flushing being an issue is when working with volatile data (flushed all the way to system memory).

 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Yeah, but VP8 almost always scales with frequency more than anything else. The 4Ghz quad-core 6700k beats the 3.6GHz six-core 6850k - 7521 to 6932... which is fully explainable with the frequency difference.

It's pretty strange considering the test nearly maxes out every core (16 threads at 85~90%)... but that might be explainable if the VP8 test itself is actually single threaded and it has locking overhead that slows down the Julia output, meaning performance is being restricted by locking contention... less threads = less contention = higher score.

RE: AVX, that actually makes sense -the uop, retire, and store queues are statically shared but rarely used for one work load as much as they can be with SIMD. AVX, in particular, could be like issuing twice as many uops for the same task as normal, making the penalty come to light in a very clear manner.
AVX 256b uops are split up in the dispatch unit. Before they are represented by 1 uop.
Because static partitioning in a six-issue queue means that each thread only haa three uop issues per cycle when ST is enabled, but six when it is disabled. This won't usually be much of a problem as the execution resources will be the bottleneck more often than not... BUT... when FastPath is in play...

...such as when AVX is used, you have a lot of macro-ops being split into two uops you can only issue two uops per cycle, instead of three uops. When you are only fetching a handful of instructions at a time, you can only fill that empty slot so well.

At least that was my thought path.

The store queue is another bottleneck in this scenario as it is also statically partitioned.

No way real way to test, though, AFAICT.
I think statically partitioned resources are vertically split, e.g. one dispatch packet for one thread per cycle, or upper and lower half of the PRFs per thread. Splitting between issue ports sounds strange.
 

deadhand

Junior Member
Mar 4, 2017
21
84
51
So is it due to the compiler or something else?

Seems like they were using _mm_stream intrinsics which write directly to RAM.



EDIT: I'm not going to pretend that I really know what's going on with regards to this, I've never heard of this issue before, but my theory that false dependencies cause inordinate problems for Ryzen might not have been too far off.
 
Last edited:
Reactions: CatMerc

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Seems like they were using _mm_stream intrinsics which write directly to RAM.

http://stackoverflow.com/questions/21207023/stream-intrinsic-degrades-performance

EDIT: I'm not going to pretend that I really know what's going on with regards to this, I've never heard of this issue before, but my theory that false dependencies cause inordinate problems for Ryzen might not have been too far off.
Ha, I'm far from comprehending what's going on myself, but it seems to be a compiler thing:
Apparently the Intel compiler is smart enough to detect the access pattern and automatically generates non-temporal loads even for the temporal version.

So AotS is built with the Intel compiler? That would explain its scalability with #of cores.

Seems plausible given the comment they made on the AMD blog.

EDIT: It was actually a quote from the PCPer article.
 

deadhand

Junior Member
Mar 4, 2017
21
84
51
The linked stackoverflow post doesn't really seem to be relevant to the ryzen issue.

More just a run-down of what it does rather than any related performance issues. As I gather the performance issues with this is fairly unique to Ryzen.

[removed]
 
Last edited:

icelight_

Junior Member
Mar 19, 2017
5
1
51
More just a run-down of what it does rather than any related performance issues. As I gather the performance issues with this is fairly unique to Ryzen.

EDIT: What I'm hearing is (paraphrasing) "the issue is caused by the buffer being flooded and spilled due to pending writes not being written out fast enough, due to waiting on false dependencies"
Linking to a different performance issue with non-temporal memory access just seemed to be asking for misunderstandings, and apparently also led to one. At least i can't see anything suggesting that AotS is compiled with intel's compiler.
I'm also interested in a source on the false dependency thing, or is that something you personally know?
 

Despoiler

Golden Member
Nov 10, 2007
1,967
772
136

kostarum

Junior Member
Sep 12, 2009
3
0
66
Can anyone tell me what are suggestions for future Ryzen optimizations and bug fixing?

Is there anything that will help without to wait for Zen 2 architecture?

Gesendet von meinem GT-I9295 mit Tapatalk
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
There are three main areas of progress in the medium term, I think:

1: AMD and m/board vendors will continue to develop BIOS updates, among other things improving RAM compatibility and various internal timings.

2: Software can continue to eliminate small, disproportionately problematic code sequences like the one found in AotS. As I understand it, MOVNT is basically deprecated these days anyway.

3: Multithreaded software can find ways to minimise unnecessary inter-thread or inter-CCX data movement which tickles the Infinity Fabric throughput limitation, and also make the most appropriate use of SMT while avoiding its inherent drawbacks.
 

deadhand

Junior Member
Mar 4, 2017
21
84
51
Linking to a different performance issue with non-temporal memory access just seemed to be asking for misunderstandings, and apparently also led to one. At least i can't see anything suggesting that AotS is compiled with intel's compiler.
I'm also interested in a source on the false dependency thing, or is that something you personally know?

I don't think it really suggested that it was an issue with the Intel compiler, just that the Intel compiler does some things differently. The purpose of linking the stack overflow question was that it gives a bit of an explanation of how it works (writing directly to ram), as well as that there can be fairly severe performance problems with it. I thought it was somewhat relevant, but I should have found a better link.

I can remove it.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Hold on a minute - why would the dispatch bandwidth be statically partitioned? Or, perhaps, is there a source to confirm that?



Likely, though, the "vertically threaded" words means every other line in the queue belongs to a different thread (which is more Bulldozer-like... and partly negates what I was saying in my woefully sleep deprived state), so there's three potential instructions every cycle on average (CPU-z actually seems to hit this limit, though I haven't tried it with SMT disabled... because I'm stupid ).

Not as good as six uops per cycle, of course, but most apps can only ever hope to reach 2uops/cycle.
 
Last edited:
Reactions: Drazick

keymaster151

Junior Member
Mar 15, 2017
15
20
36
Interesting results on Ashes of the Singularity in real gameplay with the new patch here: http://www.pcgameshardware.de/Ryzen...ecials/AMD-AotS-Patch-Test-Benchmark-1224503/

While the test does show a nice increase in fps, it's still way below the 7700k, especially in min fps (about 2.5x on the 7700k.) Can anyone test these results? If they are accurate, it would mean that AotS is by far the worst performing game on Ryzen that I've seen.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de


Likely, though, the "vertically threaded" words means every other line in the queue belongs to a different thread (which is more Bulldozer-like... and partly negates what I was saying in my woefully sleep deprived state), so there's three potential instructions every cycle on average (CPU-z actually seems to hit this limit, though I haven't tried it with SMT disabled... because I'm stupid ).

Not as good as six uops per cycle, of course, but most apps can only ever hope to reach 2uops/cycle.
I think that's just an ILP limitation of the code. The formerly Ryzen crashing flops code has some paths which should go higher than 3 in ILP (e.g. FMUL+FADD) with 1T.

Mix that with Int ops and you might see 4+ with 1T and in sum with 2T.
 
Reactions: Drazick

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I think that's just an ILP limitation of the code. The formerly Ryzen crashing flops code has some paths which should go higher than 3 in ILP (e.g. FMUL+FADD) with 1T.

Mix that with Int ops and you might see 4+ with 1T and in sum with 2T.

It would seem you're right. I tested it without SMT enabled and IPC maxed at 3.19 on a single core.

That got me to thinking, though, of finding a benchmark which could do better. GeekBench 3 actually manages it. Without SMT, it hits a per-thread IPC of 4.05 (average of the peak IPC seen across all threads). With SMT, it only manages 3.1.

I measured IPC with StatusCore.

This actually makes a LOT of sense given the SMT scaling.

4.05 IPC * 8 Cores = 32.4 IPC (MT)
3.1 IPC * 16 Cores = 49.6 IPC (SMT)

SMT peak instruction throughput improvement: 53%

I will need to hunt for something that can really push more IPC... any ideas?
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
While the test does show a nice increase in fps, it's still way below the 7700k, especially in min fps (about 2.5x on the 7700k.)

But its broadly comparable to the 6900k. Why the difference between the 8C machines and the Kaby Lake?
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |