Thank you for sticking with my commentary and providing detailed replies. It was very informative and provides me some pointers to look into. I do agree that ARM is going to play a much more prominent role in the future beyond their current ecosystem. On HSA, I had a rough idea and picture but I'm definitely technically out of my depth in terms of navigating a discussion. Thank you for clarifying and correcting things for me. I also seemed to have ignored the important thing you pointed out which is how much the board and board components would run to achieve this.
Not a problem. HSA is still pretty obscure really, and I only know about some of the issues surrounding it from tinkering with my Kaveri back in the day and trying to launch HSA kernels using a Java toolchain. It was some wonky stuff let me tell you. Regardless, lots of the underlying tech opened up all the issues surrounding SVM and (sort of) UVM and why HSA could have been so important. When it worked, it was awesome. Sort of.
To loop back around, where do you think AMD is possible going to go w/ this Rome launch (PCIE 4.0/CCIX)?
I can only offer speculation.
First they need to lay the groundwork for future tech: higher bandwidth, lower latency. The old dream AMD laid out back in the late x2/Opteron era - before they had even launched their first APU - was the Fusion concept. The Future is Fusion! Etc. And key to that strategy was HTX. The entire idea behind HTX was making as many of the computers slots and sockets equal partners in the Hypertransport hierarchy. So a 4P board with 4 HTX slots would be the same as an 8P board, effectively. There were all kinds of fun ideas there, like . . . having consumer boards in 2P configurations, allowing you to plug a "naked' GPU into the second socket and letting it use main memory like a CPU! And stuff like that. Or you could have an HTX CPU daughtercard and add a CPU or what have you.
It never materialized except on some server boards. What is interesting about HTX as it was implemented back in the day, is that the slots themselves were nothing but PCIe slots, albeit keyed in reverse (so HTX cards and PCIe cards were not physically interchangeable). What I never did figure out is how much extra board infrastructure was require to service those slots at full speed. HT-over-PCIe was a thing, and given the slot, it's tempting to think that HTX slots were exactly that: HT over PCIe. I'm too ignorant to know if that was the case. If it was, then the amount of board infrastructure (traces, layers, etc) required to service the slot would be about the same as a full length PCIe card. Which really wouldn't be bad, even for us consumers. All you're doing is bypassing the PCIe controller (wherever it may be) and forcing the connected device to handle its own traffic over the Hypertransport. Any device that goes into such a slot would have to have enough processing power to handle all that, and of course have enough HT links (on old AMD systems, it was 1 link for 1P, 2 links for 2P, or 3 links for 4-8P, if I recall correctly).
So looking to the future with Rome, PCIe4, and CCIX, it may be HTX all over again. I think the idea would be to make PCIe slots IF-capable, so if the right device plugged into the slot, you can bypass the PCIe controller in the CPU/CPUs and work as an IF device instead. Or the slots may have to be dedicated to IF functionality, I don't know. In theory this would require no more board support than that required for standard PCIe4.0. Put a dGPU in there equipped with a tiny ARM core that has enough cache (notably L3 or L4, or whatever it is Rome systems will use for cache coherence) and now you can treat your GPU as a full member of the IF. Unfortunately, dGPUs still have their own memory banks, and they probably will for the foreseeable future due to memory performance concerns. You would not want a "naked" dGPU trying to use system memory, even if it would be cool to let the dGPU manipulate data in RAM without having to do a lot of redundant copying. It is possible that each dGPU's VRAM bank can be treated as a separate NUMA node, and that might bypass the problem . . . but now you have different NUMA nodes with RAM that sports non-uniform latency and bandwidth. Sure we have MP systems with multiple memory controllers trying to balance reads and writes to minimize link traffic and make sure that data gets exactly where it needs to go, but they do so with multiple memory controllers with uniform access latency and read/write speeds. Throw some dGPUs into the mix and you have something quite different. They might have to tag link members based on which one is a CPU, dGPU, or "other" and handle traffic accordingly. Done correctly, AMD should be able to port all their existing SVM tools to CCIX systems, giving us the same basic benefit in the future that you today can have running OpenCL2.0 applications on a Raven Ridge system.
Outside of possible NUMA madness, there's also the issue of whether those IF-capable PCIe slots would really be as cheap to implement as I think. It's possible that it might not be so damn cheap. And you are still talking about putting server-class hardware on every device that has to plug into the system. For Rome, there is no problem.
What would you compare this to w.r.t existing tech? Is it like Nvlink over PCIE 4.0?
I know less about NVLink. I'm interested in it (and profoundly disappointed that nVidia apparently hasn't chosen to open up NVLink to the PCIe consortium). My understanding is that NVLink is configured to facilitate UVM support under CUDA. The entire idea behind UVM is basically . . . treat all memory of all devices connected to each other as a memory pool, and let devices read and write to other devices. The bandwidth is high and the latency is low, so it's functional. nVidia does not control the underlying platform of any system where its dGPUs operate, nor does it control the underlying CPU tech. So they implement NVLink at the pleasure of whoever chooses to host it - usually IBM POWER systems, or custom Intel boards (maybe?). Interestingly enough, it competes with the OpenCAPI which is featured on OpenPOWER systems. Or if it isn't in use today, it may see use in the future. I haven't heard much about real-world implementations of CAPI devices yet.
For more on UVM, check this out as an example:
https://devblogs.nvidia.com/unified-memory-cuda-beginners/
At first glance, it doesn't really see that different from SVM, but it really is because there is no need in the UVM model to a). ensure that all compute devices have equal access to the same memory controller or b). ensure that all compute devices are working together in a coherent NUMA environment. One of the critical lines from the above blog post is:
When code running on a CPU or GPU accesses data allocated this way (often called CUDA managed data), the CUDA system software and/or the hardware takes care of migrating memory pages to the memory of the accessing processor.
Which is pretty clever, if you think about it. NV needs tech like that since they can't control any part of the underlying platform on which their cards run. So they handle it in software with the aid of their GPUs. I still expect AMD's CCIX implementations to rely on hardware-level solutions, like NUMA nodes with atypical memory performance or something weird like that. From a performance perspective, doing it in hardware should deliver superior results.
Do you think they will expose this on AM4?
It's all a matter of cost/benefit. As I stated, AMD hasn't even rolled out the amdkfd on Windows yet. No kernel fusion driver means no SVM functionality, which is probably going to be core to everything AMD tries to do. Next you will have to get motherboard OEMs to support PCIe4.0. That may be cost-prohibitive at some price levels, for awhile at least. After that point AMD has to provide usable libraries so developers can start using all this tech seamlessly as a part of their workflow. I *think* they've already got the ball started with Mantle/Vulkan and maybe DX12 for game devs. Outside of that, everything they have is pretty janky. OpenMP to the rescue? I dunno. Ultimately it will be developers and software that drives the whole thing. It takes at least one shop deciding to offload a bunch of stuff to a GPU to get the ball rolling. LibreOffice can already use SVM to offload work to an iGPU on Kaveri, Carrizo, and Raven Ridge, but that's just one tiny example.
Threadripper? Where could this go down the line?
Threadripper is a questionmark. Again, though, it'll probably be down to the software. If AMD rolls out all the tools to clean up their OpenCL2.0/SVM compute model AND provides the hardware to support it with their dGPUs, someone will figure out how to use it to their advantage. Think small tight loops spawning in dozens of different threads, full of fp calculations being sent asynchronously to the dGPU and returned out-of-order with extreme low latency. That is something you can't really do with OpenCL 1.x or even CUDA (I don't think). With OpenCL2.0 it might be possible.
As a precaution, though, I will point out that AMD has invested in making Zen2 wider on the SIMD front. And the example I give above is one where you might want to use a CPU's SIMD capabilities instead of trying to offload to a dGPU. So AMD is at least hedging their bets.
Have any thoughts on NvME 1.4 and beyond and where that's headed and how that might integrate with AMD's roadmap?
Honestly, not really. I'm still hoping for people to stop using the m.2 form factor so damn much. But I probably won't get my way.
Any new sys memory paradigms on the horizon? When can we expect something beyond DDR4 sys mem?
HBM for everybody? DRAM's days are numbered thanks to the general inability of anyone to make it work on processes smaller than 10nm. So the switch will need to be made eventually to a memory standard that can work on 7nm, 5nm, or even 3nm.
Nobody knows but my guess is nope. AMD has ALWAYS had problems with latency compared to Intel and that hurts gaming performance significantly. It should be better than Zen+, but I think halo gamers will be disappointed with the gaming performance from Zen2. The 9900k will retain the gaming crown but Zen2 should narrow the gap some.
You mean since they introduced Zen, right? As
@tamz_msc articulated, inter-CCX latency is one of the things that's punishing Zen and Zen+ in some games. Speed up the IF link and move all 8 cores to the same chiplet and things get intereseting.