Ryzen: Strictly technical

Page 45 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

JimmiG

Platinum Member
Feb 24, 2005
2,024
112
106
This is why some people are seeing a sudden improvement in a few cases with the new Windows update - core parking was re-enabled and they didn't notice.

This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen. People there are convinced it's some giant conspiracy and Microsoft and AMD are too afraid of stepping on each other's toes to mention any such improvements. So Microsoft implemented the Ryzen "fixes" in secret and didn't mention it in the KB article.

Also, I wonder if Ryzen would have performed better in the game benchmarks if reviewers had used Balanced mode with the appropriate tweaks (especially minimum CPU at 100% to prevent latency issues due to Windows trying to manage the clock speed) instead of High Performance. This would have resulted in higher clock speeds in lightly threaded applications due to XFR, and potentially better thread management.
 
Last edited:

sm625

Diamond Member
May 6, 2011
8,172
137
106
This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen.

Something in that giant KB4013429 changed my system. Bigly. I ran the Game DVR after updating, and it totally honked my computer. It recorded my replay just fine. Just like it has recorded dozens of other past replays. But after I stopped the recording, things got really weird. There was a large delay for everything I clicked to highlight. In windows explorer, or on the desktop, when you click on a file the icon or filename highlights instantly. Even on slow machines it is pretty much instant. But after my Game DVR session, every single icon click took like 2-3 seconds for the icon to actually highlight. Yet task manager showed 99% idle. Never saw anything like this. There was nothing running bogging my system down. Only a reboot fixed it.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
This. I got down-voted on r/AMD for pointing out that there's nothing in the KB article suggesting anything was changed with regards to the scheduler and Ryzen. People there are convinced it's some giant conspiracy and Microsoft and AMD are too afraid of stepping on each other's toes to mention any such improvements. So Microsoft implemented the Ryzen "fixes" in secret and didn't mention it in the KB article.

Also, I wonder if Ryzen would have performed better in the game benchmarks if reviewers had used Balanced mode with the appropriate tweaks (especially minimum CPU at 100% to prevent latency issues due to Windows trying to manage the clock speed) instead of High Performance. This would have resulted in higher clock speeds in lightly threaded applications due to XFR, and potentially better thread management.

On my system, 2 days after the first update, there was another little update that did not require reboot...
 

innociv

Member
Jun 7, 2011
54
20
76
UMS is exactly what I'm talking about.

It's been verified that many games see Ryzen as a 16 core. As in, they are not seeing SMT and are instructing the scheduler accordingly.
 

Mockingbird

Senior member
Feb 12, 2017
733
741
106
UMS is exactly what I'm talking about.

It's been verified that many games see Ryzen as a 16 core. As in, they are not seeing SMT and are instructing the scheduler accordingly.

That's not a big issue.

One can simply go into the BIOS and turn off SMT.
 

mtcn77

Member
Feb 25, 2017
105
22
91
Is there a consensus yet on enabling HPET?
Ryzen master needs HPET enabled.
HPET actually slows down the OS timer, checked via DPC Latency Checker v1.3.0. I'm using Hypermatrix's SetTimerResolution utility to alternate between the variables.
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Is there a consensus yet on enabling HPET?
Ryzen master needs HPET enabled.

Intuitively, I would leave HPET enabled. It fixes the "sleep clock bug" when overclocking if nothing else, and is a stable, reliable, high-resolution time source. I haven't heard of any serious problems being caused by it.
 

ryzenmaster

Member
Mar 19, 2017
40
89
61
So the scheduler saga continues. This time I wanted to try out something different and see just how bad the thread migration issue is. As it turns out, there is a way to programmatically find out which core your thread is running on (see https://msdn.microsoft.com/en-us/library/ms683181(v=vs.85).aspx). Calling it every iteration would be quite excessive, so instead I made a condition where it gets called only when there was too big of a difference in latency compared to previous iteration. Maybe not perfect, but the idea being that big difference between iterations could be explained by context switch.

I also changed fixed number of iterations to timed run instead. Now each thread will run for 60 seconds and at the end report their average latency as well as all cores they visited during the run. To get migration, each run was done without affinity. First run was with 4 threads each referencing one and same tree instance.

Here are the results:

[Thread: 1] Avg: 495ns
[Thread: 1] Cores: 0000000000000001

[Thread: 2] Avg: 483ns
[Thread: 2] Cores: 1000010000001000

[Thread: 3] Avg: 483ns
[Thread: 3] Cores: 0010000000110000

[Thread: 4] Avg: 491ns
[Thread: 4] Cores: 0000100011000000

Cores are from 0-15 and value of 1 means it was visited by the thread, where as 0 means it wasn't. So during 60 second run each thread with the exception of #1 got bounced between cores, and in fact between CCX as well. Now one shortcoming is that I'm not collecting metrics on how long time was spent on each core and how frequently does migration occur. The main point however is to verify that cross CCX migration does happen.

Next up is a slightly different setup with 4 smaller 1MB instances of the tree structure containing a dictionary. Each thread this time had reference to different instances, so we have 1:1 cardinality. In this case it shouldn't make any difference which CCX threads get assigned to as long as they stay there.

Here are the results:

[Thread: 1] Avg: 231ns
[Thread: 1] Cores: 0000101000100000

[Thread: 2] Avg: 232ns
[Thread: 2] Cores: 1010000000000001

[Thread: 3] Avg: 230ns
[Thread: 3] Cores: 0100000010000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000000001100000

So again we can see all but one thread were at some point scheduled on different CCX. The overall lower latencies could be explained by smaller data set and hence smaller tree to traverse. Anyway there's again no clue of how frequently migration occurs here. If it only happens like once during the run, then it would have very little effect on overall averages over 60 second run. Indeed if we look at same scenario with 0,2,4,6 affinity, we can see that there is pretty much no difference:

[Thread: 1] Avg: 228ns
[Thread: 1] Cores: 1000000000000000

[Thread: 2] Avg: 231ns
[Thread: 2] Cores: 0010000000000000

[Thread: 3] Avg: 225ns
[Thread: 3] Cores: 0000100000000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000001000000000

Finally for the last scenario I had only one thread, which was assigned with the larger 3MB data set. During these runs I had no other active software running aside from task manager. All other cores were sitting idle with CPU usage at 1%. My benchmarking software is running continuous loop without any blocking operations, locking, thread sleeping, i/o or any other operation which might encourage context switching. Indeed there should be little to no reason to interrupt my loop during the run. If any other operations need to be scheduled, they could be done so on any of the idle cores..

So lets see what happened over 3 runs, each with only one thread:

[Thread: 1] Avg: 311ns
[Thread: 1] Cores: 0000000100000010

[Thread: 1] Avg: 309ns
[Thread: 1] Cores: 0000000000010000

[Thread: 1] Avg: 312ns
[Thread: 1] Cores: 0001000000000010

In 2/3 runs our thread not only got scheduled on different cores, but it also got bounced between the two CCX's. Based on what I saw in task manager, I believe this migration happened only once during each run, so it should hardly even show up in averages.

In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!

Q: "But wait a minute.. aren't your results suggesting that there is very little latency penalty in single threaded applications even if thread migration is real?"

A: Well. Again my scenario gives little to no reason for context switching to occur in the first place. Something like a single threaded game is much more complex, so these results do not translate well to such scenarios. This is just one specific case, but it does suffice to substantiate that thread migration is real.

Also bonus factoid: over 5 minute run with 4 threads; each thread visited every single physical and logical core at some point during the run.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
So, NewEgg and FedEx came through and got me an ASUS Crosshair VI Hero and the ASRock Fatalwhat? AB350 Gaming K4 delivered on the same day. Had some strange issue with the ASUS, which appears to just have been the mounting pressure on the CPU since I went ahead and just mounted this thing in my Enthoo Luxe case with the waterblock, so pressure is set by thumbscrews.

This is a downside to those short pins - if you put too much pressure and the board warps even a little those pins have poor/no contact with the socket as the socket itself will distort.

I will be running some experiments, including trying to get NUMA working - since these CPUs frankly need to be configured as two NUMA nodes - accessing data between the CCXes is the same as going back out to main memory... which makes it not much different than normal dual socket configurations.

I am running stability testing and validating that benchmarks haven't changed, will post anything interesting as it arises.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
which makes it not much different than normal dual socket configurations.
Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.
 

powerrush

Junior Member
Aug 18, 2016
20
4
41
Is it? Memory access is still uniform, it is more of an Skulltrail than NUMA config, now that i consider it. In fact, it is literally Skulltrail config, just on single die: 2 quad cores with plenty of cache for them connected by a bus with each other, memory and all the IO.
A bus that is clocked at the speed of ram... it is a fail
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,417
136
I wonder why it's not possible to overclock it, independently of memory.
I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.
 

powerrush

Junior Member
Aug 18, 2016
20
4
41
I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.

The interface is the Coherent Data Fabric. L3 bandwidth is several gbps above DDR4 bandwidth so, if i don't want a bottleneck between the cores, i must consider an Unified L3 cache across the 8 cores or an acelerated BUS between Core Complexes.
 
Reactions: Malogeek

powerrush

Junior Member
Aug 18, 2016
20
4
41
Look Ryzen is a beast of cpu, the ipc must be better than intel's ipc, but is dropping many cycles per second. That is the reason why in heavy multithreaded tasks performs likely i7 6900k. In heavy multitasking the threads don't cycle between core complexes or i must say "they don't cross the laggy zone"
 
Reactions: looncraz

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,417
136
The interface is the Coherent Data Fabric. L3 bandwidth is several gbps above DDR4 bandwidth so, if i don't want a bottleneck between the cores, i must consider an Unified L3 cache across the 8 cores or an acelerated BUS between Core Complexes.
You are presenting guesses as fact, what clock rate does the ccx to fabric interface run at?

Why are you talking about unified cache that has nothing to do with Zen, or how cox's are used to create scalable solutions.

First problem is no one here knows how the cache coherency protocol in Zen works. Or any directories/ snoop filters. How they handle any if the states outside of the exclusive state.

So you just run to bandwidth because that the easy thing to blame...
 
Reactions: Ajay

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
I wonder why people think it needs to be faster? There is a limit to the bandwidth of the interface between the ccx and the rest of the soc. There is also no indication that exclusive lines in the l3 can't just be copied across.

There's quite a big boost in performance when threads are split across CCX's, when you chuck in high speed RAM.



The first 2x CCX result is with 2133Mhz RAM. The second is with 3200Mhz. With the slower RAM, the draw call efficiency is better than Phenom II/Piledriver, but worse than Core 2. With the faster RAM, it performs at Core 2 levels. That's 600mhz (actual speed difference) faster RAM.

I'd really like to see how things fair, if we were able to clock the DF past what our RAM is rated for.
 

powerrush

Junior Member
Aug 18, 2016
20
4
41
You are presenting guesses as fact, what clock rate does the ccx to fabric interface run at?

Why are you talking about unified cache that has nothing to do with Zen, or how cox's are used to create scalable solutions.

First problem is no one here knows how the cache coherency protocol in Zen works. Or any directories/ snoop filters. How they handle any if the states outside of the exclusive state.

So you just run to bandwidth because that the easy thing to blame...

Maybe you can explain better why Ryzen is lagging behind Intel IPC in less threaded tasks? I will wait...
 

hamunaptra

Senior member
May 24, 2005
929
0
71
So the scheduler saga continues. This time I wanted to try out something different and see just how bad the thread migration issue is. As it turns out, there is a way to programmatically find out which core your thread is running on (see https://msdn.microsoft.com/en-us/library/ms683181(v=vs.85).aspx). Calling it every iteration would be quite excessive, so instead I made a condition where it gets called only when there was too big of a difference in latency compared to previous iteration. Maybe not perfect, but the idea being that big difference between iterations could be explained by context switch.

I also changed fixed number of iterations to timed run instead. Now each thread will run for 60 seconds and at the end report their average latency as well as all cores they visited during the run. To get migration, each run was done without affinity. First run was with 4 threads each referencing one and same tree instance.

Here are the results:

[Thread: 1] Avg: 495ns
[Thread: 1] Cores: 0000000000000001

[Thread: 2] Avg: 483ns
[Thread: 2] Cores: 1000010000001000

[Thread: 3] Avg: 483ns
[Thread: 3] Cores: 0010000000110000

[Thread: 4] Avg: 491ns
[Thread: 4] Cores: 0000100011000000

Cores are from 0-15 and value of 1 means it was visited by the thread, where as 0 means it wasn't. So during 60 second run each thread with the exception of #1 got bounced between cores, and in fact between CCX as well. Now one shortcoming is that I'm not collecting metrics on how long time was spent on each core and how frequently does migration occur. The main point however is to verify that cross CCX migration does happen.

Next up is a slightly different setup with 4 smaller 1MB instances of the tree structure containing a dictionary. Each thread this time had reference to different instances, so we have 1:1 cardinality. In this case it shouldn't make any difference which CCX threads get assigned to as long as they stay there.

Here are the results:

[Thread: 1] Avg: 231ns
[Thread: 1] Cores: 0000101000100000

[Thread: 2] Avg: 232ns
[Thread: 2] Cores: 1010000000000001

[Thread: 3] Avg: 230ns
[Thread: 3] Cores: 0100000010000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000000001100000

So again we can see all but one thread were at some point scheduled on different CCX. The overall lower latencies could be explained by smaller data set and hence smaller tree to traverse. Anyway there's again no clue of how frequently migration occurs here. If it only happens like once during the run, then it would have very little effect on overall averages over 60 second run. Indeed if we look at same scenario with 0,2,4,6 affinity, we can see that there is pretty much no difference:

[Thread: 1] Avg: 228ns
[Thread: 1] Cores: 1000000000000000

[Thread: 2] Avg: 231ns
[Thread: 2] Cores: 0010000000000000

[Thread: 3] Avg: 225ns
[Thread: 3] Cores: 0000100000000000

[Thread: 4] Avg: 231ns
[Thread: 4] Cores: 0000001000000000

Finally for the last scenario I had only one thread, which was assigned with the larger 3MB data set. During these runs I had no other active software running aside from task manager. All other cores were sitting idle with CPU usage at 1%. My benchmarking software is running continuous loop without any blocking operations, locking, thread sleeping, i/o or any other operation which might encourage context switching. Indeed there should be little to no reason to interrupt my loop during the run. If any other operations need to be scheduled, they could be done so on any of the idle cores..

So lets see what happened over 3 runs, each with only one thread:

[Thread: 1] Avg: 311ns
[Thread: 1] Cores: 0000000100000010

[Thread: 1] Avg: 309ns
[Thread: 1] Cores: 0000000000010000

[Thread: 1] Avg: 312ns
[Thread: 1] Cores: 0001000000000010

In 2/3 runs our thread not only got scheduled on different cores, but it also got bounced between the two CCX's. Based on what I saw in task manager, I believe this migration happened only once during each run, so it should hardly even show up in averages.

In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!

Q: "But wait a minute.. aren't your results suggesting that there is very little latency penalty in single threaded applications even if thread migration is real?"

A: Well. Again my scenario gives little to no reason for context switching to occur in the first place. Something like a single threaded game is much more complex, so these results do not translate well to such scenarios. This is just one specific case, but it does suffice to substantiate that thread migration is real.

Also bonus factoid: over 5 minute run with 4 threads; each thread visited every single physical and logical core at some point during the run.


Im curious, wouldnt the quantum timing of threads affect threads bouncing around?
I know there are thousands of events(things the scheduler handles & dispatches) going on in any given second, such as interrupt handling etc etc etc
Maybe the scheduler is scheduling them in such a manner that these events have higher priority and its spreading it across all cores, thus preempting your test threads quantums, causing the resultant context switch of the thread to another core for its next quantum?
I think even equal priority threads are allowed to run their full quantum, but then the next thread of same priority runs its quantum ... each causing a context switch and a potential reschedule on a different core.
Thats my guess as to why threads get bounced around so much =-)
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
In conclusion if there still was any doubt as to whether Windows 10 scheduler is needlessly bouncing threads between CCX's, I can now say with high degree of certainty: it actually does!
Is this result with Core Parking on? If so, how does it change with it off? (Or vice versa.)
 

Kromaatikse

Member
Mar 4, 2017
83
169
56
Im curious, wouldnt the quantum timing of threads affect threads bouncing around?
I know there are thousands of events(things the scheduler handles & dispatches) going on in any given second, such as interrupt handling etc etc etc
Maybe the scheduler is scheduling them in such a manner that these events have higher priority and its spreading it across all cores, thus preempting your test threads quantums, causing the resultant context switch of the thread to another core for its next quantum?
I think even equal priority threads are allowed to run their full quantum, but then the next thread of same priority runs its quantum ... each causing a context switch and a potential reschedule on a different core.
Thats my guess as to why threads get bounced around so much =-)

I have a fair idea of what's going on in the scheduler itself.

What you have to understand is that there is a single data structure across all CPUs (maybe all CPUs in the same group, in a multi-group setup), known as the "multilevel priority queue". The scheduler algorithm itself operates on all cores individually, communicating only via this single data structure. The core-parking mask is largely independent of this algorithm, and is maintained by the power-management subsystem which, I believe, operates on its own process or thread or interrupt (or something).

When a thread's timeslice expires, a timer interrupt fires and causes the thread to be suspended in favour of the scheduler algorithm. This in turn adds the thread to the appropriate level of the queue. The scheduler then looks at the queue to see which thread it should run next, which might be the same thread or a different one.

The twist is that the timer interrupt fires on *all* unparked cores simultaneously, so the scheduler algorithm is running *concurrently* on all those CPU cores. There is a mutex (or, equivalently, a lock-free algorithm) to prevent this from corrupting the queue, but the net result is a race condition with respect to exactly which core gets to pick up which thread. Especially if there is only one thread, the exact core that picks it up is random within the unparked set.

In short, the Windows scheduler is non-deterministic. This is a general property of concurrent algorithms containing race conditions.

In theory, there is a straightforward fix - or rather, two equivalently good alternative fixes which are both straightforward. Microsoft would have absolutely no trouble implementing either or both of them if it wanted to.
  • Have the scheduler timer interrupts fire in a staggered fashion, so that the algorithm (usually) does not run concurrently. This greatly increases the chance that the thread will stay on its original core when appropriate.
  • Hold the queue mutex continuously between inserting the old thread into the queue and extracting the new one. This guarantees that a single thread will remain on its original core when appropriate.
 

imported_jjj

Senior member
Feb 14, 2009
660
430
136
There's quite a big boost in performance when threads are split across CCX's, when you chuck in high speed RAM.



The first 2x CCX result is with 2133Mhz RAM. The second is with 3200Mhz. With the slower RAM, the draw call efficiency is better than Phenom II/Piledriver, but worse than Core 2. With the faster RAM, it performs at Core 2 levels. That's 600mhz (actual speed difference) faster RAM.

I'd really like to see how things fair, if we were able to clock the DF past what our RAM is rated for.

That seems mostly a matter of DRAM latency and not CCX to CCX.
You don't specify the CAS for the DRAM used but with 2133 CL14 you would get around 100ns while with 3200 CL14 you would get low 70s.
The issue with CCX to CCX seems to be the data path not BW or latency. It doesn't go straight to the other CCX and we don't quite know what it does and why.
Clocking the DF higher would reduce latency but the latency will still be high if the data always goes through a bunch of hops.
 
Reactions: Dresdenboy
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |