Its a very good post.
Now the question begs, in Maxwell when running parallel graphics+compute tasks, does the entire GPU need to flip between graphics <-> compute modes?
Its interesting to see if in Pascal, if they could do this on a TPC or GPC level which would undoubtedly make it better than the entire GPU having to change modes. I doubt they can do it in SM level (similiar to GCN where they can do it in CU level).
He said so right in that quote.
"However on Intel and Nvidia, the GPU is running either graphics or compute at a time (but can run multiple compute shaders or multiple graphics shaders concurrently). So in order to maximize the performance, you'd want submit large batches of either graphics or compute to the queue at once (not alternating between both rapidly). You get a GPU stall ("wait until idle") on each graphics<->compute switch (unless you are AMD of course)."
It's the context switch that hurts Kepler and Maxwell. Even without Async Compute, just normal graphics > compute serial rendering.
The more compute you have in a game, the worse it tanks due to this slow context switch.
NV actually say the same thing since late 2014 in their VR programming guide official PDF to developers. A compute async timewarp can and do get stuck behind a graphics draw call, even on priority preemption queue, it doesn't work.
Pascal resolves this flaw in the uarch. No slow context switch for graphics <-> compute workloads, supports fine-grained preemption, where a priority compute queue can SUSPEND a graphics queue in process and proceed immediately as it should.
This is why I say don't underestimate Pascal's uarch gains over Maxwell. The effect will be more pronounced in recent games with more compute usage. Example, Quantum Break, that has a lot of compute and copy queues mixed in with graphics all the time (per GPUView results), Pascal will easily crush Maxwell.