coercitiv
Diamond Member
- Jan 24, 2014
- 7,073
- 16,283
- 136
What stability issues?possibly just stability and power state issues at present.
What stability issues?possibly just stability and power state issues at present.
I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.
basically you are actually BETA testing.
It looks right about on par, possibly just stability and power state issues at present.
I have a source inside the PC testing circles, he had access to Ryzen from the first samples to QS samples and he posted two interesting things about Ryzen.
basically you are actually BETA testing.
I think you misread/misheard.
The only update for Linux, which happened right away, was to correctly assign SMT threads.
-znver1 is relying upon the btver1 scheduler model. Btver1 is for AMD's Bobcat.
People are finding using the Haswell scheduler model improves performance 5-10% in Linux, but they are working on a proper Zen scheduling model that should be more like 10-20% improvements in some cases.
And this is exactly what Windows needs, its own scheduler model for Ryzen...
Win10 scheduler already does that. Besides, didn't you see that explicitly setting R7 as 8 core in F1's case led to measly 3% improvement?Since we know of two different games that treat Ryzen as a 16 core processor it made me think. Is there a way to force a process to treat the CPU as 8/16 instead? If this is an issue with more games, could this be the cause of the less than optimal performance we are seeing with SMT enabled in some games?
Because SMT by the virtue of simply being enabled gimps few queues (uop queue, retire queue and store queue) with static partitioning. If you can harness the full throughput (and Zen apparently does have more non-AVX throughput than Skylake) with SMT it is great. Not so much otherwise.But if the scheduler already does that, how come there is any improvement at all?
One thing that does scale, in the real world, with frequency that is unexpected: memory performance. Usually changing the core frequency doesn't have much of an impact on memory reads and writes - maybe 500MB/s or so. I'm seeing 35GB/s changing to 43GB/s going from 3GHz to 3.8GHz - and Geekbench memory scores jumping from 3500 @ 3Ghz to 4000 at 3.8GHz.
This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.RAM acts as a last-level cache
so the problems may begin at the level of the last 2 points, right? The More requests misses and fall back to L3 or ram level, the more cycles are lost waiting for data because the fabric bandwidth is shared between all the requests.This is an oxymoron and makes no sense. Please stop repeating it from wherever you heard it.
My understanding of the cache-lookup procedure on Ryzen is as follows:
- First cache lookup is to L1-D of same core. L2 isn't touched unless this fails. At this point TLB lookup (for virtual memory) has already succeeded, one way or another.
- Next the local L2 cache is checked. This takes a bit longer, as there are more ways to go through and it's further away.
- If the local L2 cache misses, the request is sent straight to the L3 cache. This holds, among other things, a partial copy of the L2 tags for other cores in the same CCX. If one of *those* hits, the request is then forwarded to that L2 cache in a partially-decoded state; the correct cache completes the lookup and supplies the data. Because the L2-L1 hierarchy is inclusive, it is not necessary to also perform a lookup in other L1 caches.
- If the L2 tag lookups fail, the L3 lookup was already in progress and now completes. If it hits, the data is supplied by the local L3 cache and promoted to the L2 and L1 caches.
- If the local L3 cache lookup fails, the request is broadcast over Infinity Fabric to the other L3 cache(s) in the system (possibly plural because of Naples & Snowy Owl). The hybrid L2/L3 lookup procedure above is thus repeated.
- If all of these lookups fail, the request is routed to the appropriate RAM controller. This is the final resort, and is only initiated after *all* possible cache lookups have proved fruitless.
https://www.techpowerup.com/231518/a...ng-am4-updates
This. The info is incomplete, this support for these memory speeds won't be for AM4, but another platform.
Not sure if that was in reference to my last post, but all those are AM4 and the same platform as the ones we have right now.
Why it's rather challenging to add new memory divider support for CPUs is because boards are designed with specific trace lengths and all signal routing is done for specific memory frequencies.
It's not all the same, from 2133 to 3,600MHz. So even though the dividers may be added via uCode. They may not work at all simply because the board track/trace layout cannot handle such frequencies.
Again, not saying it's impossible, but these frequencies are for another platform built with these memory dividers in mind from the beginning.
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.My understanding of the cache-lookup procedure on Ryzen is as follows:
they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
Could the latency of cross-CCX access be the same as the one of the memory?Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html
Well, PCPer numbers suggest it could be. But that would be a terrific mess up at design stage, so i am optimistic and think it was a deliberate choice to skip global L3 altogether.Could the latency of cross-CCX access be the same as the one of the memory?
they assumed this obderving the abnormous jump in latency when the accessses grow larger then 8 MB right? If the latency go sky-level mean there is an acces on the ram. The test is empiric so is it possible to assume it 's 100% correct?
Your understanding is most likely wrong, because hardware.fr tests strongly imply that any accesses in any block larger than 8MB go straight to DRAM.
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html
I have a theory the dual CCX idea is basically a working development board for refining software (/hardware) for future AMD MCM (/"infinity fabric") systems. And further the dual CCX idea is currently a more applicable and "simplified" (i.e. lower initial complexity [software already knows SMP, now only 2 distinct sections needing communication]) version of Bulldozer's CMP idea; which I postulate had the same final goal (i.e. software/platform development on a competitive system) but was much too drastic a change for software to capitalize upon.
Did you say strictly technical or tin foil hats welcome?
The latency of inter-CCX cache access is still less than going all the way to RAM, but is more complicated to measure.