Communications wouldn't go through the L3 cache, so its latency is not part of the active communication. Only the tag search latency would apply. You don't pay the full access and fetch penalty on a cache miss. On Ryzen, the 1/4 tags system means that quite often you could have a result within a couple cycles after submitting an address search in the event of a failure (some 65% of the time, if my mental statistical analysis is up to snuff - no promises ).
I hope I'm not uneducated but why? Core typically does not know where the required data is at the moment. So it asks L1 Data cache, if it misses the question follows to L2 cache and in case of another miss it will ask L3. Sure, tags are gr8 but they lower typical access time from 17ns to maybe 10ns? (if core knows that data is in L3 and asks for it directly).
Now we should ask what the problems are for our core. Say it wants to access a result of computations of another thread. Threads from the same program share memory address space and are aware of their location, so it asks for a result. But it does not know where physically required data is present, so it asks the core that it knows that is/was running the thread. It must check L3 then L2 and lastly L1 cache for results (unless there is a mechanism/index that tells that this packet of data is not in cache so look elsewhere). If data is not in cache it asks memory controller for data fetch. Correct me if I'm wrong here.
A Ryzen quad core module is actually two dual core modules, when you look at it closely. Even the L3 is mostly just a duplication, with some functional block changes (PLL, and so on). The cores communicate via a mesh bus from L2 to L2 in the upper metal layers. So you have L1D Latency + L2 latency + bus latency + L2 latency + L1 Latency.
1.3 + 8.5 + 10 + 8.5 + 1.3 ~= 46 ns
To talk to another core in another CCX, contrary to my prior thinking, then you have the following latencies:
L1D + L2 + BUS + DFI + DF + IMC + DF + DFI + BUS + L2 + L1D
1.3 + 8.5 + 10 + ? + ? + ? + ? + ? + 10 + 8.5 + L1D ~= 140
DFI, of course, is the data fabric interface unit I assume must exist - can't just send signals however you please, you need to find a window and mix with any other traffic that might be on that bus.
From hardware.fr test L1 latency is 1.3ns, L2 latency is 6.3ns, L3 latency is 17.3ns
So going by the closest route, pinging another core in the "dual" on the same CCX would be:
1.3+6.3+bus+6.3+1.3=15.2ns+bus (never seen L2 mesh bus in any Ryzen diagram, are you sure about it's existence?)
Pinging core results showed over 40ns - are you suggesting that bus between L2 caches takes 25ns?
Or maybe pinging went through L3 cache instead? it certainly seems closer, 2x17.3=34.6ns (accounting that probably tests were done with different cpu frequency, it may be possible, up to 20% difference in frequency meaning 7ns could mean ~42ns as it is in core pinging results)
BTW, 1.3 + 8.5 + 10 + 8.5 + 1.3 = 29.6ns and not 46ns
http://images.anandtech.com/doci/10591/HC28.AMD.Mike Clark.final-page-013.jpg
This slide suggest that there is no special bus that links L2 straight to DF but rather go through L3 to get there. Please share the source of your info otherwise.
Why are you so bent on all queries from CCX to CCX going through memory controller, this slide again suggests otherwise:
http://images.anandtech.com/doci/11170/AMD_Ryzen_Mark_Papermaster_Final-page-009_575px.jpg
It shows cores, hubs, connected to infinity fabric and not to memory controllers, they are other entities connected to the IF.