Based on your speedups, I'm going to guess you're doing explicit time-stepping and that you're using elements with linear geometry & linear solutions. So your updates are equivalent to sparse matrix-vector multiplies in the heat transfer case.
That's pretty much the best case for GPUs. Everything is local and data dependency isn't much more complex than doing like a dot product.
Maybe you're lucky enough to be working on problems where time accuracy matters and so explicit stepping is ok. Time-accurate turbulence simulations can be one such case. Heat transfer almost certainly isn't (CFL, yikes) unless you have some ridiculously stiff (ok so im abusing this word) sources. Even then, not implicitly stepping elliptic operators is all but unheard of.
Try adding in curved geometries, higher order solutions, fluxes, and... the big one: implicit time stepping. GPU performance still beats the CPU, but not by nearly as much. And the development track is way, way harder. 100x speedup in explicit stepping isn't such a big deal if implicit can take 1000x bigger steps.
Point being that yeah, you can get amazing boosts from the GPU. I'm doing some monte carlo integration where I can precompute a ton of stuff in the MC loop so it's really just some blas2 ops in there. BAM, GPU owns it.
I've also worked on high order soln & geometry, 3D RANS (SA) solvers with implicit time stepping & DGFEM (aerospace problems). I haven't implemented this on a GPU myself but contemporary research shows speedups are more like 2-3x (code is way more complex, low level parallelism harder to find, and everything is memory bandwidth constrained), maybe 5 if you're amazing and for way more developer time. Some tasks the GPU absolutely crushes. Others are still way easier to do on a CPU (esp once you need MPI b/c even with like 16gb RAM, you can't solve a meaningfully sized problem). Don't jump on GPGPU just b/c you've seen the GPU rock out on a few things it's amazing at.
Also, 3rd decimal place is way, way too much error. Rough estimation: 1.3M elements, 6 DOFs per element, 2500 time steps, 8 neighbors per dof, maybe 20 flops per neighbor per update? That's something like 3e12 (3 trillion/terra floating point operations) ops. That's in the same ballpark as a 1tflop peak device taking 10sec. With double precision eps ~1e-16 (both CPU & Titan are IEE754 compliant as far as I know), if you're only matching 3 digits that means...
1) your problem is really ill-conditioned. (unlikely, 1M elements is fairly few)
2) you got super unlucky--perfect storm of worst-case floating point error (also unlikely since the 'worst case' almost never arises in practice)
edit: 2) is even more unlikely/impossible b/c that worst case is from doing 3e12 ops in a purely reducing operation (like summing). Effects in these kind of fluids simulations are way less global so 3e12 ops happening on one accumulator is definitely not happening.
Unless I'm mistaken there, I'd go back and double check. Do you have unit tests? What output are you comparing?
And lastly, what compiler/options are you using? float should not be slower than double on a CPU. That's... bizarre. Are you sure everything is running in float and you aren't accidentally doing a bunch of mixed precision stuff?