How do you know this for a fact though? Do you have some inside info that cuda functions properly on the 970? That may very well be the sole problem, but I don't know. I'm wondering how you know.
You don't need inside info to know how CUDA memory management operates, at least to the level pertinent to this discussion.
cudaMalloc (which is what the tool is using) allocates memory on the device. The tool actually blindly keeps trying to allocate 128MB chunks of memory until the driver responds with an error. If you get 3800MBytes through cudaMalloc, then there are 3800MBytes in system VRAM used by the device. There are no two ways about it.
Yes, WDDM's virtualization can mean few of those chunks are shared by multiple apps (cuda tool, wdm, explorer etc) and data would need to be swapped back in, before each of these applications access their context, but that's a completely different debate. It doesn't take away from the fact, that the cudaMalloc's allocated memory is on the device available to the application.
The only thing NVidia said was that applications needing less than 3.5GB will get all of the memory allocated from the 3.5GB region, and that if an application needs more RAM, it *will* get it.
Somehow this has been turned into this strong belief by few on this forum that CUDA apps can only get 3.5GB RAM, and that only games can access whole 4GB. This is utter nonsense.
The tool gets access to 4GB on the device, and it's uncontested access on a headless display. If it shows low bandwidths, than that's how the card is performing. Those low rates are still very high, if you think the system was swapping in 128MB from somewhere by magic, and even if that was true, the L2 rates will not be reduced by the same amount (once the data is swapped in, the L2 access should not have the same proportional hit)