Ryzen: Strictly technical

JimmiG · Mar 22, 2017

looncraz said:
This is why some people are seeing a sudden improvement in a few cases with the new Windows update - core parking was re-enabled and they didn't notice.

This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen. People there are convinced it's some giant conspiracy and Microsoft and AMD are too afraid of stepping on each other's toes to mention any such improvements. So Microsoft implemented the Ryzen "fixes" in secret and didn't mention it in the KB article.

Also, I wonder if Ryzen would have performed better in the game benchmarks if reviewers had used Balanced mode with the appropriate tweaks (especially minimum CPU at 100% to prevent latency issues due to Windows trying to manage the clock speed) instead of High Performance. This would have resulted in higher clock speeds in lightly threaded applications due to XFR, and potentially better thread management.

sm625 · Mar 22, 2017

JimmiG said:
This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen.

Something in that giant KB4013429 changed my system. Bigly. I ran the Game DVR after updating, and it totally honked my computer. It recorded my replay just fine. Just like it has recorded dozens of other past replays. But after I stopped the recording, things got really weird. There was a large delay for everything I clicked to highlight. In windows explorer, or on the desktop, when you click on a file the icon or filename highlights instantly. Even on slow machines it is pretty much instant. But after my Game DVR session, every single icon click took like 2-3 seconds for the icon to actually highlight. Yet task manager showed 99% idle. Never saw anything like this. There was nothing running bogging my system down. Only a reboot fixed it.

alexruiz · Mar 22, 2017

Is there a consensus yet on enabling HPET?
Ryzen master needs HPET enabled.

bjt2 · Mar 22, 2017

JimmiG said:
This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen. People there are convinced it's some giant conspiracy and Microsoft and AMD are too afraid of stepping on each other's toes to mention any such improvements. So Microsoft implemented the Ryzen "fixes" in secret and didn't mention it in the KB article.

Also, I wonder if Ryzen would have performed better in the game benchmarks if reviewers had used Balanced mode with the appropriate tweaks (especially minimum CPU at 100% to prevent latency issues due to Windows trying to manage the clock speed) instead of High Performance. This would have resulted in higher clock speeds in lightly threaded applications due to XFR, and potentially better thread management.

On my system, 2 days after the first update, there was another little update that did not require reboot...

innociv · Mar 22, 2017

UMS is exactly what I'm talking about.

It's been verified that many games see Ryzen as a 16 core. As in, they are not seeing SMT and are instructing the scheduler accordingly.

Mockingbird · Mar 22, 2017

innociv said:
UMS is exactly what I'm talking about.

It's been verified that many games see Ryzen as a 16 core. As in, they are not seeing SMT and are instructing the scheduler accordingly.

That's not a big issue.

One can simply go into the BIOS and turn off SMT.

mtcn77 · Mar 22, 2017

alexruiz said:
Is there a consensus yet on enabling HPET?
Ryzen master needs HPET enabled.

HPET actually slows down the OS timer, checked via DPC Latency Checker v1.3.0. I'm using Hypermatrix's SetTimerResolution utility to alternate between the variables.

Kromaatikse · Mar 22, 2017

alexruiz said:
Is there a consensus yet on enabling HPET?
Ryzen master needs HPET enabled.

Intuitively, I would leave HPET enabled. It fixes the "sleep clock bug" when overclocking if nothing else, and is a stable, reliable, high-resolution time source. I haven't heard of any serious problems being caused by it.

ryzenmaster · Mar 22, 2017

So the scheduler saga continues. This time I wanted to try out something different and see just how bad the thread migration issue is. As it turns out, there is a way to programmatically find out which core your thread is running on (see https://msdn.microsoft.com/en-us/library/ms683181(v=vs.85).aspx). Calling it every iteration would be quite excessive, so instead I made a condition where it gets called only when there was too big of a difference in latency compared to previous iteration. Maybe not perfect, but the idea being that big difference between iterations could be explained by context switch.

I also changed fixed number of iterations to timed run instead. Now each thread will run for 60 seconds and at the end report their average latency as well as all cores they visited during the run. To get migration, each run was done without affinity. First run was with 4 threads each referencing one and same tree instance.

Here are the results:

[Thread: 1] Avg: 495ns
[Thread: 1] Cores: 0000000000000001

[Thread: 2] Avg: 483ns
[Thread: 2] Cores: 1000010000001000

[Thread: 3] Avg: 483ns
[Thread: 3] Cores: 0010000000110000

[Thread: 4] Avg: 491ns
[Thread: 4] Cores: 0000100011000000

Cores are from 0-15 and value of 1 means it was visited by the thread, where as 0 means it wasn't. So during 60 second run each thread with the exception of #1 got bounced between cores, and in fact between CCX as well. Now one shortcoming is that I'm not collecting metrics on how long time was spent on each core and how frequently does migration occur. The main point however is to verify that cross CCX migration does happen.

Next up is a slightly different setup with 4 smaller 1MB instances of the tree structure containing a dictionary. Each thread this time had reference to different instances, so we have 1:1 cardinality. In this case it shouldn't make any difference which CCX threads get assigned to as long as they stay there.

Here are the results:

[Thread: 1] Avg: 231ns
[Thread: 1] Cores: 0000101000100000

[Thread: 2] Avg: 232ns
[Thread: 2] Cores: 1010000000000001

[Thread: 3] Avg: 230ns
[Thread: 3] Cores: 0100000010000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000000001100000

So again we can see all but one thread were at some point scheduled on different CCX. The overall lower latencies could be explained by smaller data set and hence smaller tree to traverse. Anyway there's again no clue of how frequently migration occurs here. If it only happens like once during the run, then it would have very little effect on overall averages over 60 second run. Indeed if we look at same scenario with 0,2,4,6 affinity, we can see that there is pretty much no difference:

[Thread: 1] Avg: 228ns
[Thread: 1] Cores: 1000000000000000

[Thread: 2] Avg: 231ns
[Thread: 2] Cores: 0010000000000000

[Thread: 3] Avg: 225ns
[Thread: 3] Cores: 0000100000000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000001000000000

Finally for the last scenario I had only one thread, which was assigned with the larger 3MB data set. During these runs I had no other active software running aside from task manager. All other cores were sitting idle with CPU usage at 1%. My benchmarking software is running continuous loop without any blocking operations, locking, thread sleeping, i/o or any other operation which might encourage context switching. Indeed there should be little to no reason to interrupt my loop during the run. If any other operations need to be scheduled, they could be done so on any of the idle cores..

So lets see what happened over 3 runs, each with only one thread:

[Thread: 1] Avg: 311ns
[Thread: 1] Cores: 0000000100000010

[Thread: 1] Avg: 309ns
[Thread: 1] Cores: 0000000000010000

[Thread: 1] Avg: 312ns
[Thread: 1] Cores: 0001000000000010

In 2/3 runs our thread not only got scheduled on different cores, but it also got bounced between the two CCX's. Based on what I saw in task manager, I believe this migration happened only once during each run, so it should hardly even show up in averages.

In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!

Q: "But wait a minute.. aren't your results suggesting that there is very little latency penalty in single threaded applications even if thread migration is real?"

A: Well. Again my scenario gives little to no reason for context switching to occur in the first place. Something like a single threaded game is much more complex, so these results do not translate well to such scenarios. This is just one specific case, but it does suffice to substantiate that thread migration is real.

Also bonus factoid: over 5 minute run with 4 threads; each thread visited every single physical and logical core at some point during the run.

looncraz · Mar 22, 2017

So, NewEgg and FedEx came through and got me an ASUS Crosshair VI Hero and the ASRock Fatalwhat? AB350 Gaming K4 delivered on the same day. Had some strange issue with the ASUS, which appears to just have been the mounting pressure on the CPU since I went ahead and just mounted this thing in my Enthoo Luxe case with the waterblock, so pressure is set by thumbscrews.

This is a downside to those short pins - if you put too much pressure and the board warps even a little those pins have poor/no contact with the socket as the socket itself will distort.

I will be running some experiments, including trying to get NUMA working - since these CPUs frankly need to be configured as two NUMA nodes - accessing data between the CCXes is the same as going back out to main memory... which makes it not much different than normal dual socket configurations.

I am running stability testing and validating that benchmarks haven't changed, will post anything interesting as it arises.

lolfail9001 · Mar 22, 2017

looncraz said:
which makes it not much different than normal dual socket configurations.

Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.

powerrush · Mar 22, 2017

lolfail9001 said:
Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.

A bus that is clocked at the speed of ram... it is a fail

MajinCry · Mar 22, 2017

powerrush said:
A bus that is clocked at the speed of ram... it is a fail

I wonder why it's not possible to overclock it, independently of memory.

itsmydamnation · Mar 22, 2017

MajinCry said:
I wonder why it's not possible to overclock it, independently of memory.

I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.

powerrush · Mar 22, 2017

MajinCry said:
I wonder why it's not possible to overclock it, independently of memory.

I don't know...

Do you imagine Phenom or Bulldozer with hypertransport at the speed of the ram?? Could have been a disaster...

powerrush · Mar 22, 2017

itsmydamnation said:
I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.

The interface is the Coherent Data Fabric. L3 bandwidth is several gbps above DDR4 bandwidth so, if i don't want a bottleneck between the cores, i must consider an Unified L3 cache across the 8 cores or an acelerated BUS between Core Complexes.

powerrush · Mar 22, 2017

Look Ryzen is a beast of cpu, the ipc must be better than intel's ipc, but is dropping many cycles per second. That is the reason why in heavy multithreaded tasks performs likely i7 6900k. In heavy multitasking the threads don't cycle between core complexes or i must say "they don't cross the laggy zone"

itsmydamnation · Mar 22, 2017

powerrush said:
The interface is the Coherent Data Fabric. L3 bandwidth is several gbps above DDR4 bandwidth so, if i don't want a bottleneck between the cores, i must consider an Unified L3 cache across the 8 cores or an acelerated BUS between Core Complexes.

You are presenting guesses as fact, what clock rate does the ccx to fabric interface run at?

Why are you talking about unified cache that has nothing to do with Zen, or how cox's are used to create scalable solutions.

First problem is no one here knows how the cache coherency protocol in Zen works. Or any directories/ snoop filters. How they handle any if the states outside of the exclusive state.

So you just run to bandwidth because that the easy thing to blame...

MajinCry · Mar 22, 2017

itsmydamnation said:
I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.

There's quite a big boost in performance when threads are split across CCX's, when you chuck in high speed RAM.

The first 2x CCX result is with 2133Mhz RAM. The second is with 3200Mhz. With the slower RAM, the draw call efficiency is better than Phenom II/Piledriver, but worse than Core 2. With the faster RAM, it performs at Core 2 levels. That's 600mhz (actual speed difference) faster RAM.

I'd really like to see how things fair, if we were able to clock the DF past what our RAM is rated for.

powerrush · Mar 22, 2017

itsmydamnation said:
You are presenting guesses as fact, what clock rate does the ccx to fabric interface run at?

Why are you talking about unified cache that has nothing to do with Zen, or how cox's are used to create scalable solutions.

First problem is no one here knows how the cache coherency protocol in Zen works. Or any directories/ snoop filters. How they handle any if the states outside of the exclusive state.

So you just run to bandwidth because that the easy thing to blame...

Maybe you can explain better why Ryzen is lagging behind Intel IPC in less threaded tasks? I will wait...

hamunaptra · Mar 22, 2017

ryzenmaster said:
So the scheduler saga continues. This time I wanted to try out something different and see just how bad the thread migration issue is. As it turns out, there is a way to programmatically find out which core your thread is running on (see https://msdn.microsoft.com/en-us/library/ms683181(v=vs.85).aspx). Calling it every iteration would be quite excessive, so instead I made a condition where it gets called only when there was too big of a difference in latency compared to previous iteration. Maybe not perfect, but the idea being that big difference between iterations could be explained by context switch.

I also changed fixed number of iterations to timed run instead. Now each thread will run for 60 seconds and at the end report their average latency as well as all cores they visited during the run. To get migration, each run was done without affinity. First run was with 4 threads each referencing one and same tree instance.

Here are the results:

[Thread: 1] Avg: 495ns
[Thread: 1] Cores: 0000000000000001

[Thread: 2] Avg: 483ns
[Thread: 2] Cores: 1000010000001000

[Thread: 3] Avg: 483ns
[Thread: 3] Cores: 0010000000110000

[Thread: 4] Avg: 491ns
[Thread: 4] Cores: 0000100011000000

Cores are from 0-15 and value of 1 means it was visited by the thread, where as 0 means it wasn't. So during 60 second run each thread with the exception of #1 got bounced between cores, and in fact between CCX as well. Now one shortcoming is that I'm not collecting metrics on how long time was spent on each core and how frequently does migration occur. The main point however is to verify that cross CCX migration does happen.

Next up is a slightly different setup with 4 smaller 1MB instances of the tree structure containing a dictionary. Each thread this time had reference to different instances, so we have 1:1 cardinality. In this case it shouldn't make any difference which CCX threads get assigned to as long as they stay there.

Here are the results:

[Thread: 1] Avg: 231ns
[Thread: 1] Cores: 0000101000100000

[Thread: 2] Avg: 232ns
[Thread: 2] Cores: 1010000000000001

[Thread: 3] Avg: 230ns
[Thread: 3] Cores: 0100000010000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000000001100000

So again we can see all but one thread were at some point scheduled on different CCX. The overall lower latencies could be explained by smaller data set and hence smaller tree to traverse. Anyway there's again no clue of how frequently migration occurs here. If it only happens like once during the run, then it would have very little effect on overall averages over 60 second run. Indeed if we look at same scenario with 0,2,4,6 affinity, we can see that there is pretty much no difference:

[Thread: 1] Avg: 228ns
[Thread: 1] Cores: 1000000000000000

[Thread: 2] Avg: 231ns
[Thread: 2] Cores: 0010000000000000

[Thread: 3] Avg: 225ns
[Thread: 3] Cores: 0000100000000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000001000000000

Finally for the last scenario I had only one thread, which was assigned with the larger 3MB data set. During these runs I had no other active software running aside from task manager. All other cores were sitting idle with CPU usage at 1%. My benchmarking software is running continuous loop without any blocking operations, locking, thread sleeping, i/o or any other operation which might encourage context switching. Indeed there should be little to no reason to interrupt my loop during the run. If any other operations need to be scheduled, they could be done so on any of the idle cores..

So lets see what happened over 3 runs, each with only one thread:

[Thread: 1] Avg: 311ns
[Thread: 1] Cores: 0000000100000010

[Thread: 1] Avg: 309ns
[Thread: 1] Cores: 0000000000010000

[Thread: 1] Avg: 312ns
[Thread: 1] Cores: 0001000000000010

In 2/3 runs our thread not only got scheduled on different cores, but it also got bounced between the two CCX's. Based on what I saw in task manager, I believe this migration happened only once during each run, so it should hardly even show up in averages.

In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!

Q: "But wait a minute.. aren't your results suggesting that there is very little latency penalty in single threaded applications even if thread migration is real?"

A: Well. Again my scenario gives little to no reason for context switching to occur in the first place. Something like a single threaded game is much more complex, so these results do not translate well to such scenarios. This is just one specific case, but it does suffice to substantiate that thread migration is real.

Also bonus factoid: over 5 minute run with 4 threads; each thread visited every single physical and logical core at some point during the run.

Im curious, wouldnt the quantum timing of threads affect threads bouncing around?
I know there are thousands of events(things the scheduler handles & dispatches) going on in any given second, such as interrupt handling etc etc etc
Maybe the scheduler is scheduling them in such a manner that these events have higher priority and its spreading it across all cores, thus preempting your test threads quantums, causing the resultant context switch of the thread to another core for its next quantum?
I think even equal priority threads are allowed to run their full quantum, but then the next thread of same priority runs its quantum ... each causing a context switch and a potential reschedule on a different core.
Thats my guess as to why threads get bounced around so much =-)

Kromaatikse · Mar 22, 2017

ryzenmaster said:
In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!

Is this result with Core Parking on? If so, how does it change with it off? (Or vice versa.)

Kromaatikse · Mar 22, 2017

hamunaptra said:
Im curious, wouldnt the quantum timing of threads affect threads bouncing around?
I know there are thousands of events(things the scheduler handles & dispatches) going on in any given second, such as interrupt handling etc etc etc
Maybe the scheduler is scheduling them in such a manner that these events have higher priority and its spreading it across all cores, thus preempting your test threads quantums, causing the resultant context switch of the thread to another core for its next quantum?
I think even equal priority threads are allowed to run their full quantum, but then the next thread of same priority runs its quantum ... each causing a context switch and a potential reschedule on a different core.
Thats my guess as to why threads get bounced around so much =-)

I have a fair idea of what's going on in the scheduler itself.

What you have to understand is that there is a single data structure across all CPUs (maybe all CPUs in the same group, in a multi-group setup), known as the "multilevel priority queue". The scheduler algorithm itself operates on all cores individually, communicating only via this single data structure. The core-parking mask is largely independent of this algorithm, and is maintained by the power-management subsystem which, I believe, operates on its own process or thread or interrupt (or something).

When a thread's timeslice expires, a timer interrupt fires and causes the thread to be suspended in favour of the scheduler algorithm. This in turn adds the thread to the appropriate level of the queue. The scheduler then looks at the queue to see which thread it should run next, which might be the same thread or a different one.

The twist is that the timer interrupt fires on *all* unparked cores simultaneously, so the scheduler algorithm is running *concurrently* on all those CPU cores. There is a mutex (or, equivalently, a lock-free algorithm) to prevent this from corrupting the queue, but the net result is a race condition with respect to exactly which core gets to pick up which thread. Especially if there is only one thread, the exact core that picks it up is random within the unparked set.

In short, the Windows scheduler is non-deterministic. This is a general property of concurrent algorithms containing race conditions.

In theory, there is a straightforward fix - or rather, two equivalently good alternative fixes which are both straightforward. Microsoft would have absolutely no trouble implementing either or both of them if it wanted to.

Have the scheduler timer interrupts fire in a staggered fashion, so that the algorithm (usually) does not run concurrently. This greatly increases the chance that the thread will stay on its original core when appropriate.
Hold the queue mutex continuously between inserting the old thread into the queue and extracting the new one. This guarantees that a single thread will remain on its original core when appropriate.

Ajay · Mar 22, 2017

There is a description of the scheduler from 'Windows Internals: part 1' by Russinovich and Solomon here (PDF) :
https://download.microsoft.com/down...2B9FD877995/97807356648739_SampleChapters.pdf

The 7th edition, including windows 10 and Server 2016 is due out this year (I'm probably going to get it - I'm so outdated).

imported_jjj · Mar 23, 2017

MajinCry said:
There's quite a big boost in performance when threads are split across CCX's, when you chuck in high speed RAM.

The first 2x CCX result is with 2133Mhz RAM. The second is with 3200Mhz. With the slower RAM, the draw call efficiency is better than Phenom II/Piledriver, but worse than Core 2. With the faster RAM, it performs at Core 2 levels. That's 600mhz (actual speed difference) faster RAM.

I'd really like to see how things fair, if we were able to clock the DF past what our RAM is rated for.

That seems mostly a matter of DRAM latency and not CCX to CCX.
You don't specify the CAS for the DRAM used but with 2133 CL14 you would get around 100ns while with 3200 CL14 you would get low 70s.
The issue with CCX to CCX seems to be the data path not BW or latency. It doesn't go straight to the other CCX and we don't quite know what it does and why.
Clocking the DF higher would reduce latency but the latency will still be high if the data always goes through a bunch of hops.

Ryzen: Strictly technical

Platinum Member

Diamond Member

Platinum Member

Senior member

Member

Senior member

Member

Member

Member

Senior member

Golden Member

Junior Member

Platinum Member

Platinum Member

Junior Member

Junior Member

Junior Member

Platinum Member

Platinum Member

Junior Member

Senior member

Member

Member

Lifer

Senior member