Ryzen: Strictly technical

looncraz · Mar 16, 2017

Kromaatikse said:
I found this:

Thread Ideal Processor
When you specify a thread ideal processor, the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen but provides a useful hint to the scheduler.

So, a game which is aware of Ryzen's special topology can influence Windows' scheduling behaviour without relying on an affinity mask. This is fortunate since the latter appears to be broken. The problem is that each and every game dev needs to think about and set this correctly.

I remain convinced that the scheduler itself is completely oblivious to SMT, NUMA, etc. Any illusion otherwise is given by the core-parking algorithm (which is at least SMT aware), and by affinity optimisations applied internally or externally to a given process. The documentation linked above talks a lot about applications needing to take responsibility for optimising their own affinity settings for topological considerations.

This wouldn't really help in any case for Ryzen, though. You need to set affinity to a CCX when more than one CCX is involved.

You can set 1111111100000000 for one group of related threads and 0000000011111111 for another group of related threads. Relations, in this case, are all about what memory is accessed most frequently.

CCX 0: Primary game loop thread, networking, sound processing
CCX 1: Graphics queue ordering/submission thread, player AI threads, building/object AI threads

The exact ideal distribution will depend on the game.

itsmydamnation · Mar 16, 2017

innociv said:
Jaguar is based on Bobcat which is based on K10.
Jaguar has more in common with K8 and Ryzen as well than Bulldozer.

Ryzen is sort of Phenom IV... if Phenom III ever existed. But you could think of Bobcat and Jaguar as Phenom III.

This here isn't even close to right, bobcat was not in anyway based on K10 ( go look at the uarches FFS!)
I would also say Ryzen has far more in common with Bulldozer then K8, things that look more like bulldozer then K8 in Zen:

Registers ( both I and FP )
Instruction decode/predict/fetch
retirement
ALU/AGU arrangement
load store configuration

things that look like K8 then BD in Zen:
...... nothing

You have to understand that STARS was fundamentally at a dead end in many places, Bulldozer brought many of its mirco architectural functions upto or ahead of sandy bridge. But the target of the uarch either never happened (SpMT) or was just plan wrong (CMT).

tamz_msc said:
I don't see how a 16C32T chip solves the inter-CCX data fabric latency issue from a purely HW design standpoint, unless it's a monolithic block like Intel. Unless they aren't using the 4-core CCX blocks?

How does a silicon revision solve what is fundamentally an interconnectivity issue?

While latency wont improve, throughput most certainly would, (not that i think we are throughput bound by the fabric in Zepplin) if we assume a full mesh between all the IMC's.

CatMerc · Mar 16, 2017

lolfail9001 said:
I believe there was mentioned a presently disabled ability to run DF at 2x the mem bus clock (so basically to run it at the tick rate of memory). Should AMD be able use it with some new silicon revision, the DF issues for all intents and purposes may be resolved.

It will solve bandwidth woes, but latency...

OrangeKhrush said:
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

So the question is, if AMD will bring them out in a revision (AKA 1850x or 2800x) or will just mix them in with the existing chips and have it be a matter of luck lol

Ajay · Mar 16, 2017

Kromaatikse said:
Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.

Yeah, this seems to be the case and may explain why so many games have historically had poor multicore support. Developing effective threaded code is hard enough, if one needs high performance one needs to really understand how to optimize for a given topology and actively manage it through the scheduler APIs. This is probably and advantage for IBM, Oracle or SAP, but no so for smaller dev teams with limited resources (so, many probably don't manage it and leave it to the scheduler's default behavior).

A Multi Level Feedback Queue is designed to solve this problem:

How can we design a scheduler that both minimizes response time for interactive jobs while also minimizing turnaround time without a priori knowledge of job length?

When MLFQ was developed (1960s), it was probably designed to work on single core CPUs and happened to work well with some modifications on monolithic multicore systems. Many modern CPU features (SMT, P-states) are handled with Windows Processors Power Management system: https://msdn.microsoft.com/en-us/library/windows/hardware/mt422910(v=vs.85).aspx

I will look up the API later - but it's no wonder that game devs prefer consoles with fixed hardware over a user configurable desktop OS. API: https://msdn.microsoft.com/en-us/library/aa373170(v=vs.85).aspx

dnavas · Mar 16, 2017

CatMerc said:
It will solve bandwidth woes, but latency...

...may or may not be related. TBD. Is memory latency slow because there's a prior ping across L3s? Is it slow because sync'ing between the two clock domains was causing problems? It depends on the nature of the hardware problem and the bios workarounds used to mitigate them....

lolfail9001 · Mar 16, 2017

CatMerc said:
It will solve bandwidth woes, but latency...

Well, if they can keep it at same cycle count, then it would bring latency PCPer measured inline with Haswell-E.

CatMerc · Mar 16, 2017

lolfail9001 said:
Well, if they can keep it at same cycle count, then it would bring latency PCPer measured inline with Haswell-E.

Huh...
Forgot that it doesn't need to do 40ns like intra-CCX communication, but just 80ns to match Intel's communication in worst case. In that case it should work just fine. Heck it might even end up being slightly faster than Intel because of the intra CCX communication being faster.

Dresdenboy · Mar 16, 2017

looncraz said:
I tested with two memory configurations: DDR4-2133 CL15-15-15-35 2T and DDR4-2667 CL15-15-15-35-1T.

I don't have the results on this machine, but Ryzen is very sensitive to memory performance - to the point that it even impacts benchmarks that are usually unaffected by memory clocks. That, of course, is because the data fabric runs at the same speed as the IMC... a strange, and somewhat infuriating, choice.

What's strange is how core clock impacts memory latency and bandwidth. From 3Ghz to 3.8GHz on the CPU should see no more than 1~3% difference. I go from ~35GB/s to >43GB/s using DDR4-2667. Latency drops nearly 10ns (~10%).

The guys at P3DNow! observed something similar. Interestingly I calculate the same delta between 2133 and 2667.

looncraz said:
Kind of seems like the requested data is sent directly to the L3, not the requesting core, and is then piped into the core... because L3 latencies drop with higher frequencies - as you'd expect.

That could also explain the high memory latencies...

IMC -> DF -> CCX DF -> L3 -> Core L2 -> Core Fetch

I'm writing an app to test this and trying to isolate the data fabric, caches, and IMC latencies.

Data shouldn't land in L3 first, as these are victim caches.

unseenmorbidity · Mar 16, 2017

OrangeKhrush said:
Yesterday I posted from my source talk of a new platform, today Chiphell leaked the X399 which my source confirmed is real.

I did confirm that the new silicon rivisions with "ironed" out issues is all the way from these behemoths to the entry level SKU's, I am not really getting my knickers in abunch about this anymore, performance is coming but only if you haven't bought a chip yet.

Does that mean there will be a 1750 released in the next few months?

innociv · Mar 16, 2017

CatMerc said:
Forgot that it doesn't need to do 40ns like intra-CCX communication, but just 80ns to match Intel's communication in worst case. In that case it should work just fine.

I don't think it's that simple. I think in a real application the MMU is going to get overloaded with so many cache misses happening as threads move back and forth from one CCX to another that it ends up affecting it far more than just another 80ns wait here and there over Broadwell.

Also consider how that 140ns latency of going CCX-to-CCX is slower than the 80ns or so it takes to fetch from DDR.

Kromaatikse · Mar 16, 2017

There's a good chance that inter-CCX latency is being affected by bandwidth saturation. That is, requests end up in a queue which takes time to clear. With more bandwidth draining that queue, it'll not only take less time to clear but will spend less time with queues full in the first place.

Shivansps · Mar 16, 2017

looncraz said:
This wouldn't really help in any case for Ryzen, though. You need to set affinity to a CCX when more than one CCX is involved.

You can set 1111111100000000 for one group of related threads and 0000000011111111 for another group of related threads. Relations, in this case, are all about what memory is accessed most frequently.

CCX 0: Primary game loop thread, networking, sound processing
CCX 1: Graphics queue ordering/submission thread, player AI threads, building/object AI threads

The exact ideal distribution will depend on the game.

What?! you want to separate the primary game thread from the "subsystems"?! You are also missing physics, that is not recomended to separate from the AI either.
Is not recomended to separate anything and its not recomended to try to do the scheduler job either.
Also the graphics API part will use multiple threads, specially if its Vulkan/DX12. Not sure if its a good idea to separate those either, Dota 2 on Linux runs way better under OpenGL than Vulkan on Ryzen, especially if more than 1 CCX is enabled, Vulkan on Dota 2 loves X+0.

innociv · Mar 16, 2017

Kromaatikse said:
There's a good chance that inter-CCX latency is being affected by bandwidth saturation. That is, requests end up in a queue which takes time to clear. With more bandwidth draining that queue, it'll not only take less time to clear but will spend less time with queues full in the first place.

I don't know... isn't it like 50Gb/s? That's quite a lot.
Is it really saturating 50Gb/s out of 2 8Mb caches?
I think it must be something else.

looncraz · Mar 17, 2017

Shivansps said:
What?! you want to separate the primary game thread from the "subsystems"?! You are also missing physics, that is not recomended to separate from the AI either.
Is not recomended to separate anything and its not recomended to try to do the scheduler job either.
Also the graphics API part will use multiple threads, specially if its Vulkan/DX12. Not sure if its a good idea to separate those either, Dota 2 on Linux runs way better under OpenGL than Vulkan on Ryzen, especially if more than 1 CCX is enabled, Vulkan on Dota 2 loves X+0.

It all depends on what you're doing in a given game thread. My main game threads are basically just timers and gate-keepers. They are loaded with volatile accesses, mutexes, and so on that almost always required going back to main memory for coordination. Granted, benaphores will suffer, but there's always going to be a price to pay... and if I'm going to pay it anywhere, it's going to be there - 100ns penalty on the timing code isn't such a big deal as a 100ns penalty repeated 50,000 times during frame generation.

I really wasn't trying to be all inclusive, just giving a generic idea of what should be done to optimize games specifically for the CCX issue - and that's keeping threads together which most heavily work on the same data at the same time.

A kernel scheduler can NOT know about cache locality. A CPU can, but it lacks the data it needs. If the CPU could take over scheduling and load balancing by presenting a false mapping and redirecting instructions on-the-fly, that'd be the ultimate in perfection. Kernels could then just use a simple round-robin scheduler and call it a day.

looncraz · Mar 17, 2017

Kromaatikse said:
Well, this is of course the danger Microsoft inherently runs by punting topology awareness to applications. In general, application developers aren't as aware of these details as system devs, and certainly the spread of knowledge is more uneven.

This really isn't a Microsoft issue. This is an evolutionary issue - multi-core CPUs weren't really a thing until 2005. Microsoft added GetLogicalProcessorInformation (a terrible API, but useful) in XP SP3 as well as improving threading throughout its own products. They couldn't really do much more.

There's now real way for a kernel scheduler to fully accommodate Ryzen's design without the application being involved or without creating per-application profiles... it can help out, but it will never do better than an application developer's own optimizations.

ndtech · Mar 17, 2017

looncraz said:
This wouldn't really help in any case for Ryzen, though. You need to set affinity to a CCX when more than one CCX is involved.

What about WinRAR tests with single-channel RAM with CCX affinity?
That test is very interesting thing for Ryzen.

OrangeKhrush · Mar 17, 2017

CatMerc said:
It will solve bandwidth woes, but latency...

So the question is, if AMD will bring them out in a revision (AKA 1850x or 2800x) or will just mix them in with the existing chips and have it be a matter of luck lol

It is a completely new line of CPU's with there own socked, if you read the die is big so it will not work on existing boards and existing chips will not work on that socket, it is a completely new socket.

unseenmorbidity said:
Does that mean there will be a 1750 released in the next few months?

As above new socket and I assume they will use independent monikers for it.

itsmydamnation · Mar 17, 2017

OrangeKhrush said:
It is a completely new line of CPU's with there own socked, if you read the die is big so it will not work on existing boards and existing chips will not work on that socket, it is a completely new socket.

The die is exactly the same, there are just two of them on a organic interposer.... But yes it will be the SP4 socket not AM4. TDP will be interesting, ~130Watt should get 3.0ghz base, ~180Watt 3.6ghz base. I was thinking of going a 1700 to replace my current esxi hosts, Depending on pricing something like a 12 core non X could be very interesting @550-600 USD.

edit: it would prob be more 600-700USD but one can dream .

lolfail9001 · Mar 17, 2017

innociv said:
I don't know... isn't it like 50Gb/s? That's quite a lot.

Presently it is more like 22Gb/s at 2666Mhz memory.

OrangeKhrush said:
As above new socket and I assume they will use independent monikers for it.

He is asking about Zeppelin revision that you claim will trickle down to existing line-up. As such, will the r7 line-up be refreshed?

itsmydamnation · Mar 17, 2017

lolfail9001 said:
Presently it is more like 22Gb/s at 2666Mhz memory.

if its 256bits a cycle then peak @ 1333mhz would be more like 32Gb/s.

edit: to get 22Gb/s memory has to be running @ 860mhz (1720mhz) which is just odd.

CatMerc · Mar 17, 2017

lolfail9001 said:
He is asking about Zeppelin revision that you claim will trickle down to existing line-up. As such, will the r7 line-up be refreshed?

Yeah, that.

OrangeKhrush · Mar 17, 2017

lolfail9001 said:
Presently it is more like 22Gb/s at 2666Mhz memory.

He is asking about Zeppelin revision that you claim will trickle down to existing line-up. As such, will the r7 line-up be refreshed?

That I will find out, I haven't heard of new SKU titles, just that later chips will have new steppings and behave differently.

Chl Pixo · Mar 17, 2017

lolfail9001 said:
Presently it is more like 22GB/s at 2666Mhz memory.

Isn`t that too low? Could you post the source of this information?
The speed you mentioned would be around the speed of single-channel DDR4-2666.
According the slide with clock domain the DF is 256b wide while DDR4 is 64b wide.
On 1333 MHz clock DF theoretical speed is 42.6GB/s.

If the speed was really that abysmal using DDR4-1600 would put the DF at 12.8GB/s.
Thats almost 1/2 speed of the x24 PCIe lines the CPU has.
I dont think AMD is stupid enough to do something that stupid.

Comparing dual-channel speed to single-channel would show where the bottleneck is.
If it is DF speed would be the same.

CatMerc · Mar 17, 2017

OrangeKhrush said:
That I will find out, I haven't heard of new SKU titles, just that later chips will have new steppings and behave differently.

And most likely still be on AM4.
No reason to upstart an entire long term platform only to replace it within a year with another more expensive one, if the previous platform is capable of supporting what most users need.

The 16 core Ryzen will need a new platform, but an 8 core revision should be just fine on AM4.

OrangeKhrush · Mar 17, 2017

CatMerc said:
And most likely still be on AM4.
No reason to upstart an entire long term platform only to replace it within a year with another more expensive one, if the previous platform is capable of supporting what most users need.

The 16 core Ryzen will need a new platform, but an 8 core revision should be just fine on AM4.

It is designed to be AMD's equivalent platform to Intel's X99 and soon to be X299 LGA2011 sockets, even confirmed that it is going to be LGA instead of AMD's typical PGA due to the physical size.

Ryzen: Strictly technical

Senior member

Diamond Member

Golden Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member

Golden Member

Member

Member

Diamond Member

Member

Senior member

Senior member

Junior Member

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Senior member

Junior Member

Golden Member

Senior member