Ryzen: Strictly technical

Bacon1 · Mar 11, 2017

malventano said:
show differences between G-Sync and FreeSync, which not only educated folks, it likely pushed AMD to implement of Low Framerate Compensation.
Yes, I'm an electronics geek. I try to use my skill set to help the community by pushing manufacturers to improve their products.

Any chance you will do an updated version of that test? The lack of proper freesync vs gsync comparisons in the last year+ is pretty sad.

piesquared · Mar 11, 2017

malventano said:
The 'dragged tooth and nail' you are referring to was actually multiple of days worth of us testing and retesting, calls to vendors, trying to replicate the very unique worst case scenario that was being reported all over as some devious scandal. The reason for such additional testing was that we were not seeing the issue to the same extreme as other reports were indicating. Instead of just jumping the gun and posting 'we don't see it', we spent the additional time to push the systems / games / VRAM to the point where we could see it. In the end, we had to push upwards of 150% of 4k resolution, which I should remind you is still less than 1% of the current install base over two years later.

Going back to your original statement, if you are referring to Frame Rating as 'special tests' we busted out, I should point out that we have been using Frame Rating on pretty much every single GPU review since Jan 2013. We did not 'bust out' a special thing to investigate that issue. The then 2-year old tool set was simply used to show that the issue turned out to not be as drastic as some folks were making it out to be.

I don't see how investigating issues and reporting our findings makes us 'jokers', but whatever floats your boat I guess.

[edit]
Maybe you could do some tests to see how 8 cores compare to 4 cores in gaming during quick movements like in the example i gave @ 10:46.

https://www.youtube.com/shared?ci=WBtyzo1BZp8

That looks very fluid, id be interested to know if a quad core can replicate that smoothness without having to compromise.

gtbtk · Mar 11, 2017

Guys, Has it occured to anyone that the CCX switching and slower than expected gaming performance is all connected by one root cause?

The reason that we have been observing these anomalies have nothing to do with windows scheduling. It all gets down to the Data Fabric not having enough bandwidth to cope with the load placed on it when it is dealing with a a CPU having to do large amounts of calculations with the associated memory access, combined with the graphics related load being pushed to the PCIe controller, combined with storage access combined with the CCX switching.

When the frame rates get really high, like with a Titan X at 1080p, the number of calculations and the associated memory access requirement, the number of times the CPU needs to switch CCX, access to storage all combined with the numbers of instructions to the GPU to write 200fps all have to use the Data fabric at the same time leading to contention for the available bandwidth and increasing latency to each of the components in use so that the CPU or the GPU has to stop momentarily and wait before it can continue processing. All the added wait states combine to give the computer the reduced apparent performance.

Ethernet networking over dumb hubs instead of switches has the same issue, once traffic load on the wire hits about 40% load the number of packet collisions increases exponentially and network performance pretty much stops. With Gigabit networks and network switches, we do not see those sorts of performance drop offs on Lans now the way we used to on 10Mbs coax networks 25 years ago.

When benchmarking a compute only benchmark like cinebench for example, Without the added graphics load, there is not the same amount of traffic in the data fabric and it does not start getting close to the point on the bandwidth performance curve where the performance starts being impacted.

The Data Fabric is clocked at 1/2 the memory frequency as I note has been discussed already so 3200 or 3600Mhz ram should increase the amount of available bandwidth and help alleviate the issue. The reviews, because the memory support was poor on day one tended to test with slower memory speeds that exacerbated the DF bottleneck making the performance drops more pronounced.

If you want to see the contention/bandwidth issues in action. Take a look at Ryzen firestrike results and compare them to 6900K results with the same GPU, you will see the Ryzen Graphics score is good, Physics score is good then the combined score, that all the reviewers ignored, is terrible.

JDG1980 · Mar 11, 2017

malventano said:
The 'dragged tooth and nail' you are referring to was actually multiple of days worth of us testing and retesting, calls to vendors, trying to replicate the very unique worst case scenario that was being reported all over as some devious scandal.

The "devious scandal" was that Nvidia flat-out lied to reviewers in its spec sheets, claiming that the GTX 970 had 64 ROPs and 2MB of L2 cache when in fact it had only 56 ROPs and 1.75 MB of L2 cache. This was illegal and immoral regardless of whether any of this had any real-world effect on performance or not.

looncraz · Mar 11, 2017

gtbtk said:
Guys, Has it occured to anyone that the CCX switching and slower than expected gaming performance is all connected by one root cause?

The reason that we have been observing these anomalies have nothing to do with windows scheduling. It all gets down to the Data Fabric not having enough bandwidth to cope with the load placed on it when it is dealing with a a CPU having to do large amounts of calculations with the associated memory access, combined with the graphics related load being pushed to the PCIe controller, combined with storage access combined with the CCX switching.

When the frame rates get really high, like with a Titan X at 1080p, the number of calculations and the associated memory access requirement, the number of times the CPU needs to switch CCX, access to storage all combined with the numbers of instructions to the GPU to write 200fps all have to use the Data fabric at the same time leading to contention for the available bandwidth and increasing latency to each of the components in use so that the CPU or the GPU has to stop momentarily and wait before it can continue processing. All the added wait states combine to give the computer the reduced apparent performance.

Ethernet networking over dumb hubs instead of switches has the same issue, once traffic load on the wire hits about 40% load the number of packet collisions increases exponentially and network performance pretty much stops. With Gigabit networks and network switches, we do not see those sorts of performance drop offs on Lans now the way we used to on 10Mbs coax networks 25 years ago.

When benchmarking a compute only benchmark like cinebench for example, Without the added graphics load, there is not the same amount of traffic in the data fabric and it does not start getting close to the point on the bandwidth performance curve where the performance starts being impacted.

The Data Fabric is clocked at 1/2 the memory frequency as I note has been discussed already so 3200 or 3600Mhz ram should increase the amount of available bandwidth and help alleviate the issue. The reviews, because the memory support was poor on day one tended to test with slower memory speeds that exacerbated the DF bottleneck making the performance drops more pronounced.

If you want to see the contention/bandwidth issues in action. Take a look at Ryzen firestrike results and compare them to 6900K results with the same GPU, you will see the Ryzen Graphics score is good, Physics score is good then the combined score, that all the reviewers ignored, is terrible.

I am convinced the data fabric is multiple HyperTransport 3.1 links from various locations. Four links for the memory controller, the same going between the CCXes, a couple links for PCI-e, a link for all the standard I/O, and however many for the GMI links.

That makes sense as setting it to match DDR4 speeds would be 22GB/s per link (11GB each direction)... and would only run the link at half of its specification limit - which explains why The Stilt was talking about an unused 2X DF multiplier (would be nice if you could turn that and see what happens... I bet latency drops).

Bandwidth-wise, Ryzen can sustain 40GB/s without issue over the data fabric from testing - and probably much more.

This, however, is not the problem (or only problem, perhaps) with games. I, and others, have found numerous problems with how Windows 10 impacts Ryzen performance negatively:

1. Improper core parking logic in Balanced power mode
2. Improper load balancing across CCXes.
3. Improper loading of logical (SMT) cores.
4. Improper affinity mask used when process affinity masks are set to avoid SMT cores.
6. Improper P-states set even with heavy threads resulting in low clocks (2.5Ghz in some games!).

Not all games/apps are impacted by these issues the same way. Setting high performance power profile helps, but that isn't the end of it as that disabled higher turbo states!

You still need to disable Cool-n-Quiet to prevent low clocks from being used... and Windows needs to be patched to resolve its improper handling of AMD SMT as well as to have some resistance before balancing threads across CCXes.

gtbtk · Mar 12, 2017

I am not suggesting that Ryzen has no other issues. It is a completely new architecture. The apparently unprepared motherboard manufacturers scrambling to catch up with bios revisions would suggest that AMD were rather secretive during development.

Core parking, SMT management and affinity is on Microsoft, even if it is DF bandwidth related, they need to find a better way to optimize the management of the threads in that environment. Difficult for them to prepare the appropriate patches if things were also kept secret from them as well.

While the total bandwidth is 40 GBs, multipoint networks cannot use all of it at once. There are traffic collisions and resends to contend with. You are also right that not all games are impacted. Network lag response curves are exponential by nature. Using Ethernet percentages for the discussion as I dont know the exact numbers involved here, If one game uses 36% of capacity, the performance is not overly impacted as the the lag time curve has not yet gone vertical as that happens at 40%. If you then test another game and it loads the DF up to 42% because of say, a heavier physics load, the response time is very much impacted because the contention on the network has reached the critical point and turned the curve vertical. It would seem that 1080p with a titan x hits around that point in the response curve. It they tested with a GTX 1080, I doubt anyone would have noticed

anhhq · Mar 12, 2017

chew* 's memory BW look great but L3 cache write is strange

Kromaatikse · Mar 12, 2017

In terms of Ethernet - Ethernet as used with "dumb hubs" is a half-duplex protocol. Data fabrics are designed to support multiple full-duplex data paths, and for that reason are used in Ethernet switches, which are very much more efficient than Ethernet hubs and can support Ethernet in full-duplex mode. In fact, it's impossible to use Gigabit Ethernet with dumb hubs, because it doesn't *have* a half-duplex mode.

So forget about 40% of anything.

Based on stuff I've found out in the past few days, here's how Windows appears to assign threads to cores. I'm basing this on experiments with Windows 7, but I believe Win10 is essentially similar.

1: After every timeslice, the core assignment for each thread is calculated anew, and is not stable. This results in frequent migrations of threads between cores, unless only one core is available to that thread (of which more below) or all other available cores are already busy.

2: The timeslice is based on the thread priority, which itself is influenced by the process priority. Higher priorities get longer timeslices - "realtime" gets a (pseudo?) infinite timeslice. You can therefore reduce the thread migration frequency just by increasing the process priority.

3: Core availability is determined solely by Core Parking. The scheduler itself **is not SMT aware**, but the Core Parking algorithm is. Core Parking is not part of the scheduler - it is part of the power management subsystem, and its tunables can be made available in the Power Options control panel with some registry tweaks. I emphasise: if all cores are unparked, Windows will happily migrate threads randomly onto both physical and virtual cores, just as if they were all physical.

4: If the thread or process has CPU Affinity set, that further limits the available set of cores for that thread, on top of Core Parking. However, if all affinity-set cores for a thread are parked, Core Parking is overridden and the thread is assigned to one of its affinity cores anyway. Core Parking can detect this and will unpark cores which see significant affinity activity. NB: a parked core is not necessarily shut down (in C-state) - but a parked core is *usually* idle and *therefore* usually shut down. C-stated cores are necessary to go beyond the all-core turbo speeds automatically.

5: On Windows 7, by default all physical cores have at least one virtual core unparked at all times. This might be different on Win10, based on screenshots. There is a toggle for this behaviour.

6: The Core Parking algorithm is very dumb - it doesn't seem to know how many threads are "runnable" at any given instant - and is therefore tuned to be very liberal about how many cores are unparked. I can fully believe that it will unpark about twice as many cores as there are full-time threads. This means that a 2T workload will appear to be spread evenly across 4 cores on a many-core system - but this **must not** be taken as evidence of NUMA awareness.

In short, it's a mess. As I noted over on Reddit, Linux does all of this about 500% better, because Linux people care about technical results, not risk-averse arse-covering.

Well, now the chickens have come home to roost in Redmond, as far as technical debt in the basic scheduler is concerned. Here are a couple of things that could be fixed by one competent engineer in a week:

1: Make the core assignment algorithm stable, ie. so that it prefers to assign the core that the thread is already running on, or that it last ran on (if it isn't busy). This will have immediate performance benefits on *all* SMP, SMT, CMT, and NUMA systems, not just Ryzen. Less context-switch overhead, fewer cache misses, better branch prediction, more truly-idle cores that can be C-stated. It's pure win.

2: Rework the Core Parking algorithm to use the runnable thread count to guide the number of unparked cores. On NUMA systems, unpark cores preferentially on the same node for runnable threads belonging to the same process, and preferentially on different nodes for runnable threads in different processes. Better yet, scrap Core Parking entirely and work on the best practical implementation of Point 1.

That, of course, is how Linux does it.

Magic Hate Ball · Mar 12, 2017

Kromaatikse said:
In terms of Ethernet - Ethernet as used with "dumb hubs" is a half-duplex protocol. Data fabrics are designed to support multiple full-duplex data paths, and for that reason are used in Ethernet switches, which are very much more efficient than Ethernet hubs and can support Ethernet in full-duplex mode. In fact, it's impossible to use Gigabit Ethernet with dumb hubs, because it doesn't *have* a half-duplex mode.

So forget about 40% of anything.

Based on stuff I've found out in the past few days, here's how Windows appears to assign threads to cores. I'm basing this on experiments with Windows 7, but I believe Win10 is essentially similar.

1: After every timeslice, the core assignment for each thread is calculated anew, and is not stable. This results in frequent migrations of threads between cores, unless only one core is available to that thread (of which more below) or all other available cores are already busy.

2: The timeslice is based on the thread priority, which itself is influenced by the process priority. Higher priorities get longer timeslices - "realtime" gets a (pseudo?) infinite timeslice. You can therefore reduce the thread migration frequency just by increasing the process priority.

3: Core availability is determined solely by Core Parking. The scheduler itself **is not SMT aware**, but the Core Parking algorithm is. Core Parking is not part of the scheduler - it is part of the power management subsystem, and its tunables can be made available in the Power Options control panel with some registry tweaks. I emphasise: if all cores are unparked, Windows will happily migrate threads randomly onto both physical and virtual cores, just as if they were all physical.

4: If the thread or process has CPU Affinity set, that further limits the available set of cores for that thread, on top of Core Parking. However, if all affinity-set cores for a thread are parked, Core Parking is overridden and the thread is assigned to one of its affinity cores anyway. Core Parking can detect this and will unpark cores which see significant affinity activity. NB: a parked core is not necessarily shut down (in C-state) - but a parked core is *usually* idle and *therefore* usually shut down. C-stated cores are necessary to go beyond the all-core turbo speeds automatically.

5: On Windows 7, by default all physical cores have at least one virtual core unparked at all times. This might be different on Win10, based on screenshots. There is a toggle for this behaviour.

6: The Core Parking algorithm is very dumb - it doesn't seem to know how many threads are "runnable" at any given instant - and is therefore tuned to be very liberal about how many cores are unparked. I can fully believe that it will unpark about twice as many cores as there are full-time threads. This means that a 2T workload will appear to be spread evenly across 4 cores on a many-core system - but this **must not** be taken as evidence of NUMA awareness.

In short, it's a mess. As I noted over on Reddit, Linux does all of this about 500% better, because Linux people care about technical results, not risk-averse arse-covering.

Well, now the chickens have come home to roost in Redmond, as far as technical debt in the basic scheduler is concerned. Here are a couple of things that could be fixed by one competent engineer in a week:

1: Make the core assignment algorithm stable, ie. so that it prefers to assign the core that the thread is already running on, or that it last ran on (if it isn't busy). This will have immediate performance benefits on *all* SMP, SMT, CMT, and NUMA systems, not just Ryzen. Less context-switch overhead, fewer cache misses, better branch prediction, more truly-idle cores that can be C-stated. It's pure win.

2: Rework the Core Parking algorithm to use the runnable thread count to guide the number of unparked cores. On NUMA systems, unpark cores preferentially on the same node for runnable threads belonging to the same process, and preferentially on different nodes for runnable threads in different processes. Better yet, scrap Core Parking entirely and work on the best practical implementation of Point 1.

That, of course, is how Linux does it.

I'll say I've seen some odd behavior in Windows 10. With SMT disabled while doing a Cinebench single thread test, I've seen the load move from one core to another in the MIDDLE of a chunk of render... .which seems really odd and wasteful.

I can see it transitioning between frames, which it normally seems to do, but every few seems to switch mid-render on a particular chunk of the test.

Kromaatikse · Mar 12, 2017

Migrating a thread every few seconds isn't too bad, TBH. Linux often does it as well, if there's a bit of background load as well as the main workload. Perhaps Win10's core parking algorithm is improved over Win7's.

The trouble is that, with extra cores unparked, migration happens every few *milliseconds* in Windows. That's clearly too often.

DrMrLordX · Mar 12, 2017

gtbtk said:
Guys, Has it occured to anyone that the CCX switching and slower than expected gaming performance is all connected by one root cause?

It's possible. Even if it isn't that big of a deal per se, there are obviously problems with CCX switching that can be alleviated through increasing data fabric speeds. It all points to a need for higher memory clocks.

Vaporizer · Mar 12, 2017

Timur Born said:
I asked this before and I ask it again: Did anyone setup their own Windows power profiles?

And if so, do "Core Parking" settings have an effect on Ryzen scheduling? In my experience it's not a CPU feature, but an OS feature with a fancy name where the OS scheduler tries to keep threads from jumping cores in order to allow other cores to enter deep(er) sleep states without the extra penalty of waking them up too often. This maybe could help keep threads from jumping to another CCX?!

Asus CH6 board only started shipping from most German retailers yesterday (just got it), so I still could not build my own rig. And even then I may need to order another CPU cooler, because the Arctic Liquid Freezer gets its AM4 retention module as late as April.

If you have an C6H board you do not need any AM4 cooler kit. Theyhave additional holes in the board to enablr mounting of AM3 coolers. My TR Macho cooler runs perfectly on the C6H with the AM3 attachment.

Chl Pixo · Mar 12, 2017

@looncraz where this 22 GB/s for IF comes from?
According to the AMD slide with the clock domain, the IF can do 32B/cycle.
Even using the slowest 800 MHz for DDR4-1600 it gives me 25GB/s in one direction.
The only thing I could think of lowering this is CRC.
As there is separate control and data parts it should not be lowered by signling.

EDIT: disregards. me stupid. me forgot 1024 for byte scaling vs 1000 for hz scaling.
me blames storage mafia that use 1000 for byte scaling

roybotnik · Mar 12, 2017

Threads across CCXs impact MajinCry's draw call performance benchmark. Details about the benchmark itself can be found here: http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/part-2-measuring-cpu-draw-call-performance.2499609/

I recorded videos and used process lasso to assign threads to different CCX units:

https://www.youtube.com/watch?v=BORHnYLLgyY

And here's one with Windows occasionally assigning threads to one or both CCXs:

https://www.youtube.com/watch?v=JbryPYcnscA

17 FPS when threads are on one CCX. 14 FPS when threads are split. Windows doesn't know about them and assigns the threads to whatever.

looncraz · Mar 12, 2017

gtbtk said:
I am not suggesting that Ryzen has no other issues. It is a completely new architecture. The apparently unprepared motherboard manufacturers scrambling to catch up with bios revisions would suggest that AMD were rather secretive during development.

AMD released a GCC patch over a year ago which contained all the information anyone needed to optimize for Zen aside from a few minor details. AMD then applied improvements to the Linux Kernel to better handle Ryzen - that should have given any kernel developer the information they needed.

Motherboard makers were caught with their pants down because they prioritized Intel's new releases then went on vacation while having previously anticipated working with AMD in Q4 '16. When AMD delayed for four months (from October to March) they obviously didn't use that time to work with what they had... except for Gigabyte, it seems, who has far fewer of the teething issues.

gtbtk said:
While the total bandwidth is 40 GBs, multipoint networks cannot use all of it at once.

It would be point-to-point for critical paths, I'd imagine, as this was specifically mentioned in some documents (using an upper metal layer as a dedicated data bus). I believe it was The Stilt who said that there were areas that were 256-bit wide (the die shot would certainly allow plenty of room for that). That is EIGHT HT 3.1 links, assuming that is what they used (rather than some variant thereof), so they must be doing some dedicated pathways.

http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

Infinity Fabric can apparently scale beyond HyperTransport (or is supposed to...), so it may be a native 64-bit wide HyperTransport (instead of 32-bit).

Mind you, I'm not discounting the possibility that data fabric contention could cause intermittent spikes in latency, but I'm not seeing it in my testing - and it seems games perform exactly as expected once the Windows 10 mess is hidden (i.e. in Windows 7).

looncraz · Mar 12, 2017

Chl Pixo said:
@looncraz where this 22 GB/s for IF comes from?
According to the AMD slide with the clock domain, the IF can do 32B/cycle.
Even using the slowest 800 MHz for DDR4-1600 it gives me 25GB/s in one direction.
The only thing I could think of lowering this is CRC.
As there is separate control and data parts it should not be lowered by signling.

Several sources from AMD have said the 22GB/s figure specifically to various people, but I would not be surprised to see it be higher in aggregate. The figure, though, does make sense. 32 byte/cycle is just four HT 3.1 links (32 * 4 DDR = 256 / 8 = 32 byte).

I think if we have a latency problem anywhere, it's in the IMC being split between completely opposite sides of the die. But if the upper layers are used for the buses, then we're still only talking about adding 1ns of latency because of that. If it was a data contention matter we'd see highly variable memory latency figures, but they're pretty consistently garbage.

Chl Pixo · Mar 12, 2017

@looncraz would you be willing to run the W7 and W10 tests in KVM VM using minimal linux install with kernel 4.10 ?
If i am not mistaken the real shedulling would be left to linux.
Results could be interesting.

looncraz · Mar 12, 2017

Chl Pixo said:
@looncraz would you be willing to run the W7 and W10 tests in KVM VM using minimal linux install with kernel 4.10 ?
If i am not mistaken the real shedulling would be left to linux.
Results could be interesting.

Linux testing is next week Virtual machine performance testing is part of that, though I was mostly going to use that for my own purposes (I'm trying to get rid of Windows and go back to running anything else as my main OS... but Linux went in the wrong direction for quite a while... Cinnamon is nice, though ).

Chl Pixo · Mar 12, 2017

looncraz said:
Linux testing is next week Virtual machine performance testing is part of that, though I was mostly going to use that for my own purposes (I'm trying to get rid of Windows and go back to running anything else as my main OS... but Linux went in the wrong direction for quite a while... Cinnamon is nice, though ).

Let me know what MB you are using and how the iommu groups look.
There was info posted with iommu for ASUS prime B530-plus that did not look like the chipset has isolation.
I cant decide between r7-1700X and i7-6800K. If X370 does not support isolation on chipset level i would be left with intel.
Or I would have to wait for Naples, how the prices will look for workstation HW.

I realy dont like the intel path, it has 2 core less and upgrade would mean buying new MB.

Sven_eng · Mar 12, 2017

piesquared said:
I'm referring to all the 'special' test PCPer busts out on an AMD launch, yet has to be dragged tooth and nail to even acknowledge issue like the 4GB/3.5GB memory issue with the 970.

But since you are in the mood for investigating, i wonder if you could set up your 4 core test to replicate this scenario and show the results. I'd like to know where the choke points are on a 4 core system. Is there anything positive you can investigate on Ryzen, or are you only interested in exploiting a single lower than expected result (which is being rectified with developers)-gaming?
Anyway, Jayz2Cents seems quite surprised at the results here, like it is something he hasn't experienced before, which points to advantages with 8 cores. Any chance of some time spent on this?

@10:36

https://youtu.be/8-mMBbWHrwM?t=636

They could use the slow motion camera they found just in time to show the Freesync ghosting issue, which was actually a panel issue and nothing wrong about FreeSync.

But I would prefer Ryan pointed it at Petersen's face and asked him about async compute on Maxwell again.

Timur Born · Mar 12, 2017

looncraz said:
Not all games/apps are impacted by these issues the same way. Setting high performance power profile helps, but that isn't the end of it as that disabled higher turbo states!

And again I wonder why no one here tries to set up their own Windows power profile? It's not like Windows' behavior cannot be fine-tunes (partly even controlled) in many regards. Power profiles have been mentioned so often, yet no one reported trying to get their hands dirty on a custom profile.

I still don't have a cooler fitting the AM4 socket, so unfortunately I cannot do it myself. Go hunt!

Timur Born · Mar 12, 2017

Vaporizer said:
If you have an C6H board you do not need any AM4 cooler kit. Theyhave additional holes in the board to enablr mounting of AM3 coolers. My TR Macho cooler runs perfectly on the C6H with the AM3 attachment.

Good to know, thanks. I only got the board and CPU yesterday and was mostly out, so I'm gonna try the AM3 retention kit today. Before I read your post I got my old 4790K stock cooler out of the basement, hoping it would use the old clamp design, but it does not. Thanks again for pointing me there.

I don't know if I will get the build finished today. First I have to move my 4790K parts to another chassis in order to be able to run both builds side by side.

Kromaatikse · Mar 12, 2017

Timur Born said:
And again I wonder why no one here tries to set up their own Windows power profile? It's not like Windows' behavior cannot be fine-tunes (partly even controlled) in many regards. Power profiles have been mentioned so often, yet no one reported trying to get their hands dirty on a custom profile.

I've been trying that myself in Win7, centred around the unlockable tweaks to the Core Parking algorithm I mentioned above. Unfortunately it's very hard - and maybe impossible - to make this algorithm behave correctly using just these tweaks, even with respect to the much simpler Kaveri APU I'm currently stuck with. It just doesn't have the correct data as inputs.

I *may* have managed to identify the DLL containing the relevant code. If I can track down the actual function implementing it and disassemble it, we could compare the respective code between Win7 and Win10. There's then half a chance of binary-patching a better algorithm in, though that in itself might then run afoul of code-signing protection on Win10.

CatMerc · Mar 12, 2017

Drawcall testing of CCX communication penalty.
tl;dw - 17 FPS when threads are on one CCX, 14FPS when threads are split across CCX's.

Here it can be seen that Windows randomly assigns threads to the CCX's. Sometimes it splits them, sometimes it keeps them on one.

unseenmorbidity · Mar 12, 2017

Apparently Allyn at PCper redid the test,and now admits he is wrong. He hasn't removed the article though...

Ryzen: Strictly technical

Diamond Member

Golden Member

Junior Member

Golden Member

Senior member

Junior Member

Junior Member

Member

Senior member

Member

Lifer

Member

Junior Member

Junior Member

Senior member

Senior member

Junior Member

Senior member

Junior Member

Member

Senior member

Senior member

Member

Golden Member

Golden Member