kokhua
Member
- Sep 27, 2018
- 86
- 47
- 91
You just cant add more cores to a socket willy nilly anyway, you need to have both the pins to deliver power to it and the power delivery already in the motherboard design. I don't need to draw a diagram, if AMD wanted to increase packaging complexity and deploy a new socket they could package more in chip dies using the existing ratio's. new DDR and PCI-E standard will keep providing more bandwidth per pin allowing each die to continue increasing core counts.
AMD has publicly said that ROME will be socket-compatible to NAPLES. It is reasonable to assume that AMD planned ahead when they defined the SP3 socket. In any case, the power and pin-count issue is the same whether you are talking about 4x16C or 8x8C+1.
Currently the home agents for tracking memory requests are the memory controller the data resides on, this means any data that needs to transfer between CCX's needs to check where it is from the memory controller that data ultimately resides on, if CCX A in on die A and CCX B on die B annd memory resides on die C you can see the problem. Distributed home agents means generally that the home agent closest to the data is responsible for tracking that data.
This was the big change intel made when moving to skylake SP/EP/whatever and go look at the complex memory workload benchmark data to see the result. Except intel have a home agent per Core which is actually a massive amount of logic, AMD could do one per CCX.
Thanks for the explanation. My hypothetical architecture is UMA, Home Agents are not even needed, correct?
It all comes down to cost, remember you have the built and entire SKU stack, whats your costs look like on a salvage 4 or 6 core part.
Of course. Here, my hypothetical architecture is even more flexible. You can configure various SKUs in 2 ways:
(a) Use fewer CPU dies. For example, 64C would be 8+1 and 32C SKU would be 4+1.
(b) Use salvaged dies. For example, 64C would be 8+1 fully good dies and 32C would still be 8+1 but the each of the 8 CPU dies would have 4 non-functioning or fused off cores.
Given the tiny die size of the CPU, I suspect yields will be very high and (a) will be the preferred option. Either way, I don't see a cost disadvantage vs 4x16C. As an aside, note that option (a) applied to Threadripper is even more advantageous. Now you can configure a 16C TR3 with just 2+1 dies and you still get 4ch memory and the full 128 PCIe. You can even get the full 8ch memory if AMD decides to support it in future. And there are none of the NUMA problems associated with TR2.
I've read it, doesn't change my point, your making things cost more power and more latency all memory requests get an extra hop in each direction. If you can make it go faster (architectural design around memory access) you can do the same thing with out the uncore die and be even faster again.
Worse case performance is always a cache miss/dirty line and reread from memory, you just increased access latency.
I understand and acknowledge that the biggest disadvantage of this architecture is the increased round-trip latency arising from the "dis-integration" of the CCX from the memory controller. The additional latency comes from the CPU-to-System Controller link and the router ("Cache Coherent Network" in the diagram). But there are things you can do to mitigate this:
(a) As I explained earlier, the latency of the CPU-SC link can be made very low; for example by using a fast and wide parallel link, avoiding the latency of a SERDES altogether. Effectively, these are just "wires". Since these "wires" are very short (~2-3mm), direct and located at the die edges, we can use very low power drivers together with low voltage swing to reduce power consumption. Of course a parallel interface means a lot of signals. How many? Assuming each core needs 2.5GB/s bandwidth, the CPU-SC link would have to be capable of at least 20GB/s. If each wire is 4Gbps single-ended (~DDR5 rates), then we need 40 signals at a minimum. Let's say we need 60; that would be something like 3 rows of 20 C4 bumps along the edge of the dies. I don't think it is impossible. We can probably use ordinary MCM packaging, without resorting to silicon interposers. Of course, if AMD has something equivalent to EMIB, that's even better. What is the added latency in this case? Maybe a couple of ns, I don't know for sure. But certainly not a showstopper.
(b) Move to an 8-core CCX and double the shared L3 cache to 32MB. This should reduce the memory traffic meaningfully. I estimated the CPU die size will increase by roughy 35% from 48mm^2 to 64mm^2, which is roughly what I depicted in the diagram.
(c) As for the "router", I am not sure how much latency that adds. I'm out of my depth here. But this is not something new. Ampere's just released ARM Server Processor sports a very similar architecture, except it's monolithic (https://amperecomputing.com/wp-content/uploads/2018/02/ampere-product-brief.pdf). AMD certainly does not lack expertise in this.
(d) Finally, grouping all the 8 memory channels together gives you a lot of flexibility in optimizing memory controller architecture. I'm sure there are plenty of tricks you can use to hide/overlap DRAM latency. And the very high bandwidth will certainly improve utilization efficiency given the bursty nature of memory access patterns.
To summarize, I don't think the performance will be any worse than the 4x16C configuration. I suspect (speculate) that moving away from NUMA alone will provide a nice uplift.
Its not wishfull its just generally not worth it otherwise we would be seeing dirt cheap 45/32nm SOI edram caches packaged with lots of CPU's but we dont
I don't know why we are not seeing many designs with L4 cache but I certainly can't conclude that it is not worth it. Could it be because you can't add a meaningfully large L4 cache on a monolithic die? I imagine you would need something on the order of 8MB/core to get meaningful benefits. I do know of one example system that uses L4 cache: IBM z14. I presume it is quite successful.
yeah i dont see it. AMD think they can address 80% of the market with naples, they will be able to address even more with a 64 core 256mb L3 , improved inter CCX latency SKU. How much more of the market does your design on up and at what cost?
Below is what I wrote for the advantages my architecture offers:
"..you can configure various SKUs for EPYC and TR by using more or fewer CPU dies as necessary. No artificial de-featuring or need for salvaging dies. None of the core/cache ratio, memory channels and I/O availability concerns. None of the duplicated blocks wasting die area like NAPLES needed. Very simple “wiring” at the package level. Also, none of the NUMA issues that caused NAPLES to underperform Xeon in some workloads. Supports 2P and 4P configurations with only 1 hop from any requester to any responder compared with 2P only for NAPLES and 2 hops.
I especially like the idea of a large L4 cache (or L3, if you remove that from the CCX and enlarge the L1/L2 instead). It would be very beneficial for many server workloads. Of course, that is just my wishful thinking. After estimating the die size, I concluded that it would not be possible to add an L4 cache of meaningful size. But who knows, if they decide to move the SC to 7nm as well in Milan...."
They relate mainly to cost and flexibility. I've done some very crude cost estimates of both 4x16C and 8x8C+1 configurations, the manufacturing costs are very similar, around $200. I did not claim that my architecture is better than a scaled NAPLES architecture or will address a larger market. To me both 8x8C+1 and 4x16C are fine and will put AMD in a good competitive position vs Intel.
Let me be clear once again: my diagram simply reflects my belief that ROME will be 9 dies. If you have reason not to believe the 9-die rumor, then stick with 4x16C or whatever. If you also believe that ROME will be 9-die, or at least think it is possible, but my diagram is nonsense, then I'd like to learn why you think that.[/QUOTE]
Last edited: