Timur Born
Senior member
- Feb 14, 2016
- 300
- 154
- 116
Is it expected that this only leads to minor performance differences then (2x CCX being slightly faster)?
Is there any way to make a custom voltage curve and bind it to multiplier? So I can have 1.15V for 3Ghz and 1.35V for 4Ghz?
Yes, via P-State settings. Currently seems to be bugged in the AMD CBS BIOS settings (if available at all), but ASUS ZenStates for Windows or ZenStates-Linux may work for you.
I even saw x18.5 now and still wonder why this only happens with AMD's power profile?Curious observation: Setting minimum processor state to 50% or lower in the AMD Ryzen power profile results in the a x19.38 minimum multipliert. Doing the same in the Power Saver profile results in the usual x22 minimum multiplier. Any ideas why?
Curious observation: Setting minimum processor state to 50% or lower in the AMD Ryzen power profile results in the a x19.38 minimum multipliert. Doing the same in the Power Saver profile results in the usual x22 minimum multiplier. Any ideas why?
Here is some very interesting L3 cache behavior from Ryzen cpus discussed. This cache behavior was discovered by hardware.fr
https://forum.beyond3d.com/threads/amd-ryzen-cpu-architecture-for-2017.56187/page-88
edit:
Forgot to add my text :
Am i wrong with this, but maybe it is obvious ? The L3 cache is a victim cache only for the ccx it is directly connected to. L3 for CCX A has no purpose for CCX B and vice versa.
I sense something similar happening with the old CPU-Z benchmarks, where certain sequence of integer instructions were unexpectedly running much faster.The behavior is exactly as expected for a bifurcated victim cache (or dual independent victim caches).
The Ryzen 5 1400 has 8MB of L3 total, in two 4MB chunks. Allocating 4MB will not entirely end up in the L3 - some will be in the L2, some will be in the L3, and some will not be on-die at all, but still out in main memory as the other program data, kernel overhead, and context switching will easily cause lines to be evicted from the L3.
For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).
For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).
The behavior is exactly as expected for a bifurcated victim cache (or dual independent victim caches).
The Ryzen 5 1400 has 8MB of L3 total, in two 4MB chunks. Allocating 4MB will not entirely end up in the L3 - some will be in the L2, some will be in the L3, and some will not be on-die at all, but still out in main memory as the other program data, kernel overhead, and context switching will easily cause lines to be evicted from the L3.
For the 16MB L3 CPUs any single chunk over 6.5~7.5MB will have the same problem. Cache aware algorithms need to treat Ryzen as if it has 45% of the reported L3. I've made adjustments to some of my code to that effect and noticed 30%+ improvements in performance - so now Ryzen is showing me a few cases where it is 40% faster than Sandy Bridge per clock with my own rolled code (very specific cases running integer and floating point operations in close proximity in a set of chunked data).
Thanks for the reply. The other interesting part of the test is the 1500x and the 1600 behavior.
Both have 2x8MB for a 4/8 and a 6/12 configuration. Yet at 6MB and 8MB there is a latency increase. While the 1600x which is also 6/12 has not. Is this a measurement error or is the internal cache configuration or the way the cores can access it in the 1500x and 1600 different from the 1600x ?
Why is the 1600x performing as if it has full 2x 8MB ?
Only after 8MB the latency jumps to around 80ns, in line with system memory access times.
I would think that the difference between the 1600 and 1600x is explained by the faster average latency in the faster-clocked L1/L2/L3 caches of the 1600x.
Meanwhile, the 1500x has 2 fewer L1 and L2 caches, making it more dependent on the L3 cache (and it's higher latency) for the fetches that would have been an L2 hit in the six-cores.
People here were reporting broken 16-bit execution in VM on Ryzen soon after it was released.Another interesing CPU side bug was found on Ryzen:
http://www.os2museum.com/wp/vme-broken-on-amd-ryzen/
VME stands for Virtual 8086 Mode Enhancements. It is not related to modern virtualization, instead it is related to the Virtual 8086 mode introduced with the 386 when running 16 Bits code in a 32 Bits OS. The crashes are more easily triggeable in a Windows XP VM, but it can happen natively in FreeDOS, too.
Just wondering, with the new AGESA 1.006, has anyone tried 2T versus 1T RAM? Was the extra tight timings/RAM speed on 2T worth the penalty?
Missed it completely. I was focused on the IOMMU side of things... And still no news about PCIe ACS (Access Control Services)
What penalty you're talking about?
Based on my own experience there is no reason to use 2T on Ryzen, ever.
The memory controller basically outruns the actual memory signaling (design dependant), even on the very high-end boards.
TechReport says +3866MHz dimms require 2T.I meant when you run RAM in 2T, there is a performance penalty. OFten though, 2T means you can get faster clocks or tighter timings. The question is if the extra clocks/tight timings outweighs the 2T penalty.
But if it is looking like 2T is a liability rather than an asset, then it is 1T all the way.
http://techreport.com/news/31974/ryzen-agesa-1-0-0-6-exposes-more-memory-overclocking-optionsFor example, the hot G.Skill DDR4-3866 memory I have here requires a 38.66 multiplier and a 2T command rate, but those settings simply couldn't be dialed in on Ryzen motherboards under the current AGESA (1.0.0.4) without resorting to changes in the base clock that also controls important bus rates like PCI Express.
TechReport says +3866MHz dimms require 2T.
http://techreport.com/news/31974/ryzen-agesa-1-0-0-6-exposes-more-memory-overclocking-options
Saw it already. Check here to see what it does, specifically my posts. I'm too lazy to redo the explanation from scratch.AGESA 1.0.0.6 properly implements PCIe ACS All good now for PCI passthrough without the need for patches.
You previously said the memory controller need not run 2T. I supposed the qualification was related to the controller, not the dimms.They refer to the specifications of the memory modules themselves. I'm quite certain that nobody has run Ryzen with the 3866MHz MEMCLK ratio at this point
http://gskill.com/en/product/f4-3866c18d-16gtz
- Tested Speed 3866MHz
- 18-19-19-39-2N
All high speed G.Skill kits (other than FlareX for Ryzen) are 2N rated anyway.