GPGPU computing and virtual machines

koshling

Member
Nov 15, 2005
43
0
0
GPGPU is obviously becoming more important, but currently it's pretty inaccessible in terms of programming model.

What prevents (or when do you think we'll see, if nothing fundamentally prevents) extensions of common virtual programming environments (Java and .NET really) in some constrained fashion into the GPGPU realm?

Obviously GPGPU is sensible/optimal only for certain classes of algorithm, and likely that would reflect into the virtual environment in several ways:

1) Very limitted support for system libraries normally taken for granted in those environments (right down to none at all possibly, apart from GPGPU support infrastructure classes)

2) Probably some extensions, either to languages, or (more likely) via system support classes to provide highly parallel exection models

3) New system support classes to allow (presumably asynchronous) commiunication between the GPGPU and CPU realms (maybe this looks just like distributed computing models, but not sure since the performance characteristics would be atypical)

With APUs heading for the mainstream from both AMD and Intel, and the currently extant discrete GPU's it seems that there could be significant benefits from leveraging the GPGPU capabilities much more widely. However, to realistically do so they need to be much more accessible than is the case today. What are your thoughts...?
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
To begin, lets distinguish virtual machines, i.e., system emulation platforms like VMWare's product line, KVM, VirtualBox, etc., from managed runtimes, like Java and .NET, which flatter themselves with the term "virtual machine" but provide a much higher-level abstraction. There are actually significant GPGPU challenges in both areas, but since OP is clearly talking about managed runtimes, my response is targeted toward that arena.

The first problem to address before we'll see any significant widespread adoption of GPGPU in managed languages is the discovery of meaningful GPGPU workloads. Right now, GPGPU is almost exclusively used for numerical applications (read: so-called 'scientific' codes, historically a zero-billion dollar industry). There is some hope for physics offload, though that will probably be done directly through a vendor API, e.g, DX, or perhaps more directly though an OEM driver.

In other words, GPUs are so hard to use (not only to program, but to find suitable algorithms) that there just aren't a lot of workloads -- regardless of language -- that are good fits for GPGPU paradigms.

There are very few, or no, implementation impossibilities in merging GPGPU and managed languages. There are very few languages that cannot link to C -- the mechanics of 'hooking up' the GPU in virtually any language are solvable.

Addressing the numeric bullets:
1) I think 'limited support for system libraries' will be a huge understatement. More likely, if ever GPGPU becomes a mainstream paradigm, the entire call will be fully encapsulated into something the average programmer doesn't see. E.g., you may one day instantiate a 'multiply-accumulate-array-pair' object, which under the hood uses the GPU through an object.Go() interface (or if no GPU is available, the CPU). GPUs would have to make enormous advances to ever run anything except compiled and annotated C directly on the GPU.

2) I think classes for parallelism are likely, but there just aren't that many compute-parallel jobs that matter. Video transcoding, games, and that's about it. And those things already do a great job utilizing the GPU as a GPU, with the kind of workload for which the GPU is designed.
The more likely parallelism abstractions will be those that improve perceived quality, i.e., not though parallel computation but though concurrent execution of some unrelated feature. An animated paperclip and a spinning hot dog and some background threads searching for "french fries" because your phone heard you say "phillip j fry", etc.

3) Asynchronous APIs are already there. But the same reason they never took off for commodity coders for files, disks, etc., will prevent them from taking off for GPGPU: Asynchrony is hard. Whatever ends up leveraging GPUs will present a synchronous, blocking API to Joe user.

Lastly, on-CPU GPU integration may lead to new opportunities in GPGPU, but the integration trend isn't driven by these potential opportunities. The primary motivator here is for both chipmakers to deliver 'good enough' graphics at low cost and low power envelopes, to motivate adoption of their products in new classes of devices. A secondary motivator is the obvious problem that nobody knows how to use CMPs (i.e., write parallel code), and that integrating more cores just isn't worth it -- SoC is a viable means to at least add features and continue selling chips in the short term.
 

Cogman

Lifer
Sep 19, 2000
10,283
134
106
just one more thing to add to degibson's discussion. Not only are GPGPUs generally used for sciency applications, they are used for large sciency applications. In other words, you wouldn't used a GPGPU to multiple a 10 array object by 2. That would be extremely slow.

The issue is that transferring data from the CPU to the GPGPU is a slow processes. It should get somewhat faster with new APUS, but I wouldn't hold my breath for it ever becoming a lightning fast operation.

x264, a project that does do a lot of math, has looked several times at using GPGPU programming, however, they couldn't ever come up with a large enough chunk of data that needed to be processed all in one go. The latency of talking to a GPGPU ended up killing any attempt to move data processing over to the GPGPU.

Don't get me wrong, other encoders have done it. However, they all look like crap compared to x264. The end up doing a large amount of processing on the GPU, but have to take short cuts to get it to work.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
An animated paperclip and a spinning hot dog and some background threads searching for "french fries" because your phone heard you say "phillip j fry", etc.

Rofl. Best summary of mainstream consumer applications for parallel processing ever.

I'll add that parallelism, of which GPGPU programming is one example, is all the rage because we can no longer talk about the next generation of processor/platform as being exponentially faster than the last. So now its all about cores. Parallelism has a big positive impact at the operating system layer. More cores for my Windows 7 64-bit to play with == better. But as degibson accurately and humorously highlights, we're way past the point where processing speed or throughput matters in consumer web or desktop applications. Mobile is where the speed gains will be impactful for the next few years.
 
Last edited:

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
Despite all the articles mentioning the clock-speed wall with parallelism as the escape hatch, 99&#37; of code in consumer applications is still going to be single-threaded.

Applications are becoming more threaded to take advantage of multiple cores, but each thread is generally working on its own serial task. For example Word might have _1_ background thread doing grammar checking, not a parallel array of threads.

There's going to be some niche work like PhotoShop where you might apply some filter to every pixel in an image, but that will stay as native code for speed.
 

koshling

Member
Nov 15, 2005
43
0
0
Point taken on suitable workloads (at least the MARKET for suitable workloads, and therefore the return on effort-expended).

For reference the thing that triggered this was some work I am doing on Shape recognition within a very constrained domain (the game of Go). Essentially the workload involved matching a library of patterns of a specific position (which is a grid of points that may have one of a small number of possible values). Each search uses a very small dataset (a match instance needs access only a few hundred bytes of data), but there are many searches, which is highly parallelisable. It struck that this was probably a very GP-GPU-friendly workload, which is what brought me to the acessibility (or otherwise) of the paradigm from my target environemnt.
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
In general, the characteristics of a good GPGPU application are:
- Large data set, needed to amortize the round-trip latency of the PCIe bus.
- High compute-to-memory ratio, needed to make the GPU noticeably better than the CPU
- Recurrence in the algorithm, so that a meaningful data set can be loaded onto the GPUs memory and used many times, so that the PCIe bandwidth isn't a rate limiter, and
- Low control complexity, such that the GPU can actually execute the workload

It sounds like your shape recognition problem lacks the large data set.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,363
4,068
75
I don't know; I'd like to hear more about this pattern matching. I looked up the game, and I see what you mean by patterns, I think. If you're loading each go pattern (into shared memory, possibly from texture or cache memory), and comparing it to each position on the board (stored in texture or cache memory), then doing it seven more times for rotated and mirrored versions, that might be a large enough data set.

I also have to say that some kind of checksum done on board rectangles, followed by a lookup table, sounds really tempting. Lookup tables aren't good on a GPU, but if you could pull it off on a CPU it might be faster than what I described above. The problem may be that the patterns of stones and empty spaces may not be rectangular.
 

koshling

Member
Nov 15, 2005
43
0
0
I don't know; I'd like to hear more about this pattern matching. I looked up the game, and I see what you mean by patterns, I think. If you're loading each go pattern (into shared memory, possibly from texture or cache memory), and comparing it to each position on the board (stored in texture or cache memory), then doing it seven more times for rotated and mirrored versions, that might be a large enough data set.

I also have to say that some kind of checksum done on board rectangles, followed by a lookup table, sounds really tempting. Lookup tables aren't good on a GPU, but if you could pull it off on a CPU it might be faster than what I described above. The problem may be that the patterns of stones and empty spaces may not be rectangular.

Something like that, but lookup tables are not sufficiently scalable really since patterns can vary considerably in size but can be quite large in some cases, and each point can have 3 states (actually 4 in some sense since edges are significant so its beneficial to encode the lines 'just off' the play area as an edge state to also be matcched against). Furthermore they may not be rectangular (often are not, since you want to exclude form matching intersections that don't effect the function of the pattern), and this rather clobbers the loookup-table approach.

What I actually do is represent a row of the board as a 64-bit quantity (actually only using 21 bits from the high aand low words) encoding its interestion state as 2-bits each. I then sweep the pattern rows (by logical rotation and masking, and actuaally applying a pattern mask also to handle the non-rectangular issue) across the board state to find matches by row (and on first row match check second row etc). This is a very efficient process, but it's also compute/memory intensive. Although the dataset is not that large (the board state you are matching against is totally static, at least for each game tree node) and the bitmaps representing the shapes are individually small, there are a lot of them. Once you generate all rotations and reflections (and color swaps) you quickly get to pattern sets in the hundreds of patterns or even thousands, which leads to quite a lot of computation.

The problem with GPU-ising this (in isolation at least from the rest of the game analysis) is going to be latency in collecting results which will bottleneck progress through the game tree evaluating nodes. I suspect though that you could go breadth-first and present the GPU with a batch of positions (1-ply of the game tree) to analyse at once, in which case you multiply the problem by the branching factor (typically quite large in Go even with fairly aggressive pruning).

PS - if anyone's interested I started a blog on my adventures trying to write a Go program a few days ago. The first few entries are (and will be since I only just started) a bit retrospective (I actually started a few weeks ago on the program) and the more interesting stuff hasn't really been blogged about yet, but I'm posting most days, so if anyone is interested: http://writingagoprogram.blogspot.com/
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
I'm sure you realized this, but a big advantage of a CPU-only approach is that your program can be ported to run on more than just Windows PCs that have a GPU, for example to iPhone / iPad (though run-time for higher AI skill levels might be painful).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |