That was awesome, and probably (definitely) over my head.
Glad you liked it. Bear in mind that there's a lot more for me to learn on the subject. Once of these days I'll mess with aparapi and see what I can learn that way. All this waiting for Project Sumatra to finally show up in Java 9 is like . . . waiting for desktop Carrizo.
But!
So dGPU via pcie is latency bound, in the past this has just produced a new interface to alleviate that latency. vlbuss, pci, agp, pcix, pcie, whatever's next.
Compute unit daughter cards? pciS(super)? One of them seems more likely(cheaper) than trying to squeeze a foot long GPU onto the CPU's real estate. Or is the entirety of todays dGPU not needed?
Vesa local bus, ah the memories. But I digress.
I was poking around on wikipedia's
PCI-e article today, and I found this line to be . . . interesting:
"PCIe sends all control messages, including interrupts, over the same links used for data. The serial protocol can never be blocked, so latency is still comparable to conventional PCI, which has dedicated interrupt lines."
Over the years, changes in expansion buses have mostly brought us higher bandwidth and more capability to run multiple devices without conflicts (anyone who was an early adopter of PCI sound cards may know what I'm talking about here. Ensoniq AudioPCI + 3dfx Voodoo Rush = fail). What it does not seem to have brought us is significantly lower latencies, at least not in the jump from PCI to PCI-e (I will qualify this statement by saying that I am uncertain of PCI and PCI-e latencies being similar as a function of bus clocks or as a direct comparison in nanoseconds). And, for the most part, this makes sense: most expansion cards don't require latencies that low. Graphics cards get away with it by rendering the scene on-card and pushing out to the monitor without having to send anything much back to the CPU for processing. With GPGPU, you can't do that. All the work has to go back to the CPU.
Buuuuut if you look at certain niche network applications, such as super-duper highspeed cluster interconnects, then you will see some expansion slots that HAVE produced lower latencies over the years. Example: AMD's own HTX slot. HTX is a workstation/server standard that basically offers a direct link to the Hypertransport system to an expansion card. It is undoubtedly expensive to implement since the HT topography requires direct links between pretty much every other HT device local to the board. Or . . . something like that.
So if there's any expansion slot out there today in AMD's stable that's really suited to dGPU compute in latency-intensive situations, it's HTX. Intel's QPI spec allows for QPI expansion devices if I recall, though I have not seen them market anything like that or offer QPI slots on their platforms.
NostaSeronX (where has that guy been lately? It's like he vanished after several of the AMD roadmap announcements/leaks . . .) was going on somewhere about a hypothetical "PCI-e over HT" hybrid slot, also dubbed HTX, which might find its way onto certain AMD systems in the near-ish future. It seemed to function in such a way that a small, short-headered HT packet could nest itself in a larger PCI-e packet. It was basically allowing the system to extend an HT link to a device over the PCI-e bus with HT-like latency, which would be hella cool if AMD could make that work, especially since cost-to-implement would (presumably) be low enough for it to make it onto consumer motherboards. That would also make for an acceptably low-latency expansion slot for dGPUs to make them more versatile in latency-intensive situations. That being said, it was NostaSeronX, and . . . yeah.
All that aside, I am going to agree with greatnoob that the future of GPGPU is probably going to involve a lot of cooperation between iGPUs and dGPUs as they work asynchronously to knock out operations much more quickly than you could with traditional x86 cores. Until low-latency expansion slots become a reality for consumer-level machines, the low-hanging fruit (massive, highly-parallel workloads) will go to whatever kind of GPU is available while the oddball stuff (smaller, intermittent parallel workloads handled today via SIMD) will make iGPUs shine if the coders bother with it.
DrMrLordX,
As it stands now, I am not sure I believe even the (rumored) Bristol Ridge APU will be enough. (re: Only four "weak" AMD big cores in late 2016? and it still needs dual channel memory (although it is DDR4) to get the 512sp going.)
Two things here (some of which I've said elsewhere . . .):
1). I think we all know that Bristol Ridge is a stopgap until AMD can replace Excavator with Zen in their APUs (and move to 14nm on all their CPUs). Whether or not this is solely a WSA issue is an interesting question which may never receive satisfactory answer.
2). Don't be so quick to dismiss AMD's 28nm planar Construction cores as "weak". We haven't seen Excavator at work, and it will be awhile before we see it in desktop form. Expect Carrizo to have the usual mobile processor compromises. Regardless . . . Steamroller alone has some real performance surprises as I have learned in recent weeks running code on the thing (and I'm not talking about aparapi, c++ amp, or anything like that).
As an example, take a loot at CRFX's
500m y-cruncher score with a 5 ghz 8350 (161.398s). Now compare that to
my 500m run with a 4.7 ghz 7700k (212.737s). I got a little speed boost running Linux, interestingly enough, but still: I was only about 50s off from a Piledriver with twice the modules and a 300 mhz clockspeed advantage. You double up my modules (which is technically impossible on FM2+) so I have 4M and my time would be somewhere in the 110-120s range, probably. Still not world-beating, but it would open a few eyebrows.