Ryzen: Strictly technical

Page 64 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Timur Born

Senior member
Feb 14, 2016
277
139
116
Is it expected that this only leads to minor performance differences then (2x CCX being slightly faster)?
 

Timur Born

Senior member
Feb 14, 2016
277
139
116
Curious observation: Setting minimum processor state to 50% or lower in the AMD Ryzen power profile results in the a x19.38 minimum multipliert. Doing the same in the Power Saver profile results in the usual x22 minimum multiplier. Any ideas why?
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Yes, via P-State settings. Currently seems to be bugged in the AMD CBS BIOS settings (if available at all), but ASUS ZenStates for Windows or ZenStates-Linux may work for you.

Only two P-states are available to software P-State 1 and P-State 2.

P-State 0 is a hardware-mediated turbo that won't be engaged (but will determine the TSC frequency) and 'higher' P-states are for low power modes.
 

Timur Born

Senior member
Feb 14, 2016
277
139
116
Curious observation: Setting minimum processor state to 50% or lower in the AMD Ryzen power profile results in the a x19.38 minimum multipliert. Doing the same in the Power Saver profile results in the usual x22 minimum multiplier. Any ideas why?
I even saw x18.5 now and still wonder why this only happens with AMD's power profile?
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
Curious observation: Setting minimum processor state to 50% or lower in the AMD Ryzen power profile results in the a x19.38 minimum multipliert. Doing the same in the Power Saver profile results in the usual x22 minimum multiplier. Any ideas why?

It is probably using two different power / frequency scaling curves. And one is less accurate or less optimized than the other. I can't find any documentation on it though.
 
May 11, 2008
20,041
1,289
126
Here is some very interesting L3 cache behavior from Ryzen cpus discussed. This cache behavior was discovered by hardware.fr

https://forum.beyond3d.com/threads/amd-ryzen-cpu-architecture-for-2017.56187/page-88




edit:
Forgot to add my text :
Am i wrong with this, but maybe it is obvious ? The L3 cache is a victim cache only for the ccx it is directly connected to. L3 for CCX A has no purpose for CCX B and vice versa.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Here is some very interesting L3 cache behavior from Ryzen cpus discussed. This cache behavior was discovered by hardware.fr

https://forum.beyond3d.com/threads/amd-ryzen-cpu-architecture-for-2017.56187/page-88




edit:
Forgot to add my text :
Am i wrong with this, but maybe it is obvious ? The L3 cache is a victim cache only for the ccx it is directly connected to. L3 for CCX A has no purpose for CCX B and vice versa.

The behavior is exactly as expected for a bifurcated victim cache (or dual independent victim caches).

The Ryzen 5 1400 has 8MB of L3 total, in two 4MB chunks. Allocating 4MB will not entirely end up in the L3 - some will be in the L2, some will be in the L3, and some will not be on-die at all, but still out in main memory as the other program data, kernel overhead, and context switching will easily cause lines to be evicted from the L3.

For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
The behavior is exactly as expected for a bifurcated victim cache (or dual independent victim caches).

The Ryzen 5 1400 has 8MB of L3 total, in two 4MB chunks. Allocating 4MB will not entirely end up in the L3 - some will be in the L2, some will be in the L3, and some will not be on-die at all, but still out in main memory as the other program data, kernel overhead, and context switching will easily cause lines to be evicted from the L3.

For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).
I sense something similar happening with the old CPU-Z benchmarks, where certain sequence of integer instructions were unexpectedly running much faster.
 
Last edited:

ryzenmaster

Member
Mar 19, 2017
40
89
61
For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).

This. It may require a bit more work, but optimizing for Ryzen does need the developer to be aware of the fact that one ought not treat Ryzens L3 as one coherent 16MB. Basically this means either staying below 8MB or splitting data sets in multiple chunks and possibly even using affinity masks as I've seen it make a difference in some scenarios.

Don't feel discouraged though. Game developers are optimizing with <= 8MB L3 in mind anyway since that's what you'd mostly expect from Intels consumer products, such as the i7-7700k with 8MB of L3.
 
May 11, 2008
20,041
1,289
126
The behavior is exactly as expected for a bifurcated victim cache (or dual independent victim caches).

The Ryzen 5 1400 has 8MB of L3 total, in two 4MB chunks. Allocating 4MB will not entirely end up in the L3 - some will be in the L2, some will be in the L3, and some will not be on-die at all, but still out in main memory as the other program data, kernel overhead, and context switching will easily cause lines to be evicted from the L3.

For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).

Thanks for the reply. The other interesting part of the test is the 1500x and the 1600 behavior.
Both have 2x8MB for a 4/8 and a 6/12 configuration. Yet at 6MB and 8MB there is a latency increase. While the 1600x which is also 6/12 has not. Is this a measurement error or is the internal cache configuration or the way the cores can access it in the 1500x and 1600 different from the 1600x ?
Why is the 1600x performing as if it has full 2x 8MB ?
Only after 8MB the latency jumps to around 80ns, in line with system memory access times.
 

Space Tyrant

Member
Feb 14, 2017
149
115
116
Thanks for the reply. The other interesting part of the test is the 1500x and the 1600 behavior.
Both have 2x8MB for a 4/8 and a 6/12 configuration. Yet at 6MB and 8MB there is a latency increase. While the 1600x which is also 6/12 has not. Is this a measurement error or is the internal cache configuration or the way the cores can access it in the 1500x and 1600 different from the 1600x ?
Why is the 1600x performing as if it has full 2x 8MB ?
Only after 8MB the latency jumps to around 80ns, in line with system memory access times.

I would think that the difference between the 1600 and 1600x is explained by the faster average latency in the faster-clocked L1/L2/L3 caches of the 1600x.

Meanwhile, the 1500x has 2 fewer L1 and L2 caches, making it more dependent on the L3 cache (and it's higher latency) for the fetches that would have been an L2 hit in the six-cores.
 
May 11, 2008
20,041
1,289
126
I would think that the difference between the 1600 and 1600x is explained by the faster average latency in the faster-clocked L1/L2/L3 caches of the 1600x.

Meanwhile, the 1500x has 2 fewer L1 and L2 caches, making it more dependent on the L3 cache (and it's higher latency) for the fetches that would have been an L2 hit in the six-cores.

Clockspeed was my first guess. But 2 and 3 times higher latency at 6 and 8 MB for 1600 and 1500x compared to 1600x does not fit that idea.
But when googling, :

https://www.techpowerup.com/231268/...yzed-improvements-improveable-ccx-compromises
Here is the same effect for the 1800x as for ther 1500x and 1600.







Notice how the 1600x does not seem to have similar latency increase at 6MB and 8MB ?
Must be an measurement error.
 

zir_blazer

Golden Member
Jun 6, 2013
1,184
459
136
Another interesing CPU side bug was found on Ryzen:
http://www.os2museum.com/wp/vme-broken-on-amd-ryzen/

VME stands for Virtual 8086 Mode Enhancements. It is not related to modern virtualization, instead it is related to the Virtual 8086 mode introduced with the 386 when running 16 Bits code in a 32 Bits OS. The crashes are more easily triggeable in a Windows XP VM, but it can happen natively in FreeDOS, too.
 

Malogeek

Golden Member
Mar 5, 2017
1,390
778
136
yaktribe.org
Another interesing CPU side bug was found on Ryzen:
http://www.os2museum.com/wp/vme-broken-on-amd-ryzen/

VME stands for Virtual 8086 Mode Enhancements. It is not related to modern virtualization, instead it is related to the Virtual 8086 mode introduced with the 386 when running 16 Bits code in a 32 Bits OS. The crashes are more easily triggeable in a Windows XP VM, but it can happen natively in FreeDOS, too.
People here were reporting broken 16-bit execution in VM on Ryzen soon after it was released.
 
Reactions: Drazick

zir_blazer

Golden Member
Jun 6, 2013
1,184
459
136
Missed it completely. I was focused on the IOMMU side of things... And still no news about PCIe ACS (Access Control Services)
 
Reactions: Drazick

CrazyElf

Member
May 28, 2013
88
21
81
Just wondering, with the new AGESA 1.006, has anyone tried 2T versus 1T RAM? Was the extra tight timings/RAM speed on 2T worth the penalty?
 
Reactions: Drazick

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Just wondering, with the new AGESA 1.006, has anyone tried 2T versus 1T RAM? Was the extra tight timings/RAM speed on 2T worth the penalty?

What penalty you're talking about?
Based on my own experience there is no reason to use 2T on Ryzen, ever.
The memory controller basically outruns the actual memory signaling (design dependant), even on the very high-end boards.
 

CrazyElf

Member
May 28, 2013
88
21
81
What penalty you're talking about?
Based on my own experience there is no reason to use 2T on Ryzen, ever.
The memory controller basically outruns the actual memory signaling (design dependant), even on the very high-end boards.

I meant when you run RAM in 2T, there is a performance penalty. OFten though, 2T means you can get faster clocks or tighter timings. The question is if the extra clocks/tight timings outweighs the 2T penalty.

But if it is looking like 2T is a liability rather than an asset, then it is 1T all the way.
 
Reactions: Drazick

mtcn77

Member
Feb 25, 2017
105
22
91
I meant when you run RAM in 2T, there is a performance penalty. OFten though, 2T means you can get faster clocks or tighter timings. The question is if the extra clocks/tight timings outweighs the 2T penalty.

But if it is looking like 2T is a liability rather than an asset, then it is 1T all the way.
TechReport says +3866MHz dimms require 2T.
For example, the hot G.Skill DDR4-3866 memory I have here requires a 38.66 multiplier and a 2T command rate, but those settings simply couldn't be dialed in on Ryzen motherboards under the current AGESA (1.0.0.4) without resorting to changes in the base clock that also controls important bus rates like PCI Express.
http://techreport.com/news/31974/ryzen-agesa-1-0-0-6-exposes-more-memory-overclocking-options
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106

They refer to the specifications of the memory modules themselves. I'm quite certain that nobody has run Ryzen with the 3866MHz MEMCLK ratio at this point

http://gskill.com/en/product/f4-3866c18d-16gtz
http://gskill.com/en/product/f4-3866c18d-16gtz
- Tested Speed 3866MHz
- 18-19-19-39-2N

All high speed G.Skill kits (other than FlareX for Ryzen) are 2N rated anyway.
 
Reactions: Drazick and w3rd

zir_blazer

Golden Member
Jun 6, 2013
1,184
459
136
AGESA 1.0.0.6 properly implements PCIe ACS All good now for PCI passthrough without the need for patches.
Saw it already. Check here to see what it does, specifically my posts. I'm too lazy to redo the explanation from scratch.
Basically, at the moment, ACS is helpful for X370 Motherboards that bifurcates a Root Port 16 lanes to 8x/8x, so you can plug two cards and they will be in an exclusive IOMMU Group of their own. This gives Ryzen an edge when compared to Skylake in the same situation.
There is more work to be done, since the Chipset group is still a ugly mess. Other Multifunction Devices are also all getting in the same IOMMU Group (This is default behavior for Multifunction Devices, it is assumed that other Functions of the same Device aren't isolated unless the vendor states otherwise and a quirk for these are included in the Linux Kernel), which means that the Azalia, USB and SATA Controllers still require the ACS override patch. Now that ACS is done, the next step seems to be ATS (Address Translation Services).
 

mtcn77

Member
Feb 25, 2017
105
22
91
They refer to the specifications of the memory modules themselves. I'm quite certain that nobody has run Ryzen with the 3866MHz MEMCLK ratio at this point

http://gskill.com/en/product/f4-3866c18d-16gtz
- Tested Speed 3866MHz
- 18-19-19-39-2N

All high speed G.Skill kits (other than FlareX for Ryzen) are 2N rated anyway.
You previously said the memory controller need not run 2T. I supposed the qualification was related to the controller, not the dimms.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |