GPU Supercomputing

njdevilsfan87

Platinum Member
Apr 19, 2007
2,331
251
126
I've entered the world of CUDA programming... What a pain in the ass, but now that I've got something, it looks like GPU super computing is the real deal and I'm strictly coding any simulations I'm running via CUDA now. I did a quick test case of a basic 3D heat transfer problem via conduction only. A 512x512x64 array, time stepped 100 times (or 100 loops).

3770K 4.6ghz 8 threads used via OpenMP : 459 seconds
GTX Titan stock : 6 seconds (overclocked 5.8 seconds)

I was like D: when I saw how fast my Titan did it. This is truly amazing. I just had to get that out.
 

serpretetsky

Senior member
Jan 7, 2012
642
26
101
I'm still not up to your level of programming yet. Is there a reason you chose CUDA over opencl?
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
The tooling for checking performance and such are still better on CUDA than openCL. But I would rather target openCL now as the standard will eventually take the entire market and future given enough time.
 

njdevilsfan87

Platinum Member
Apr 19, 2007
2,331
251
126
I'm still not up to your level of programming yet. Is there a reason you chose CUDA over opencl?

I do plan to try out OpenCL as well. I just haven't got around to it. I'll spend the next few months just coding some calculations utilizing CUDA, and then I'll redo the same ones with OpenCL. It's just that given I have a Titan, the first thing that came to mind was CUDA. It turns out Nvidia has a package called Nsight which makes getting CUDA setup in Visual Studio quick and painless. The pain oinly came afterward. But now it's starting to feel very, very worth it.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91

njdevilsfan87

Platinum Member
Apr 19, 2007
2,331
251
126
Here's another I'm now working on : turbulent flow through a long hollow square. It is sized 800x40x40 = 1.28M elements, with 2500 time steps. All calculations are of type "double" (float is actually 50% faster on Titan for this calculation, and slower on the CPU).

2500K @ 3.9ghz (1 threads) : 2609 seconds
3770K @ 4.6ghz (8 threads) : 491 seconds
Titan @ 1006/6000 : 10.96 seconds
Titan @ 1306/6000 : 8.94 seconds

It feels like I'm coding on a desktop from 10 years from now. Though right now output is differing at the third decimal point (between CPU and GPU). I'm not sure if that's normal, or if it's something on my end.
 
Last edited:

eLiu

Diamond Member
Jun 4, 2001
6,407
1
0
Based on your speedups, I'm going to guess you're doing explicit time-stepping and that you're using elements with linear geometry & linear solutions. So your updates are equivalent to sparse matrix-vector multiplies in the heat transfer case.

That's pretty much the best case for GPUs. Everything is local and data dependency isn't much more complex than doing like a dot product.

Maybe you're lucky enough to be working on problems where time accuracy matters and so explicit stepping is ok. Time-accurate turbulence simulations can be one such case. Heat transfer almost certainly isn't (CFL, yikes) unless you have some ridiculously stiff (ok so im abusing this word) sources. Even then, not implicitly stepping elliptic operators is all but unheard of.

Try adding in curved geometries, higher order solutions, fluxes, and... the big one: implicit time stepping. GPU performance still beats the CPU, but not by nearly as much. And the development track is way, way harder. 100x speedup in explicit stepping isn't such a big deal if implicit can take 1000x bigger steps.

Point being that yeah, you can get amazing boosts from the GPU. I'm doing some monte carlo integration where I can precompute a ton of stuff in the MC loop so it's really just some blas2 ops in there. BAM, GPU owns it.

I've also worked on high order soln & geometry, 3D RANS (SA) solvers with implicit time stepping & DGFEM (aerospace problems). I haven't implemented this on a GPU myself but contemporary research shows speedups are more like 2-3x (code is way more complex, low level parallelism harder to find, and everything is memory bandwidth constrained), maybe 5 if you're amazing and for way more developer time. Some tasks the GPU absolutely crushes. Others are still way easier to do on a CPU (esp once you need MPI b/c even with like 16gb RAM, you can't solve a meaningfully sized problem). Don't jump on GPGPU just b/c you've seen the GPU rock out on a few things it's amazing at.

Also, 3rd decimal place is way, way too much error. Rough estimation: 1.3M elements, 6 DOFs per element, 2500 time steps, 8 neighbors per dof, maybe 20 flops per neighbor per update? That's something like 3e12 (3 trillion/terra floating point operations) ops. That's in the same ballpark as a 1tflop peak device taking 10sec. With double precision eps ~1e-16 (both CPU & Titan are IEE754 compliant as far as I know), if you're only matching 3 digits that means...
1) your problem is really ill-conditioned. (unlikely, 1M elements is fairly few)
2) you got super unlucky--perfect storm of worst-case floating point error (also unlikely since the 'worst case' almost never arises in practice)
edit: 2) is even more unlikely/impossible b/c that worst case is from doing 3e12 ops in a purely reducing operation (like summing). Effects in these kind of fluids simulations are way less global so 3e12 ops happening on one accumulator is definitely not happening.

Unless I'm mistaken there, I'd go back and double check. Do you have unit tests? What output are you comparing?

And lastly, what compiler/options are you using? float should not be slower than double on a CPU. That's... bizarre. Are you sure everything is running in float and you aren't accidentally doing a bunch of mixed precision stuff?
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |