From everything I've been able to discern, the Grid Management Unit (or GMU) is a system that works in tandem with HyperQ to increase parallelism as much as possible by prioritizing workloads dynamically, even to the point where it can pause them or hold pending or suspended grids..
So having one GMU shouldn't matter, as all it's doing is dynamically managing which grids to be sent to processing.
It's a single unit. A single block. A single block is not as Dynamic or Parallel as 8 separate units. This single unit is also limited in the amount of queues it can prioritize when compared to the competition. Maxwell2 is thus not nearly as Parallel, as a matter of fact, as GCN 1.1 (290 series)/GCN 1.2. There is no disputing this. It is a hardware fact.
If this was such a big deal, I think NVidia would have already done so for Maxwell.. The entire process is very dynamic with Maxwell, so adding another GMU wouldn't likely make a big difference considering that the GMU can pause active, or hold pending or suspended grids.
It is only a big deal if you have games which make use of the amount of Parallelism that Ashes of the Singularity use. nVIDIAs response, to the Ashes benchmark, points to the fact that nVIDIA doesn't believe that Ashes of the Singularity is an overall good example of future DX12 titles. This is likely the same logic on which they based their decisions when building Maxwell2.
Also Fury X has a big bandwidth advantage on the GTX 980 Ti which becomes more pronounced at higher resolutions, particularly when MSAA is thrown into the mix. So it's not surprising that the Fury X would gain an edge at 4K when MSAA is being used..
Fury-X only has a big bandwidth advantage on paper. In practice, however, nVIDIAs compression algorithms even up the score. See here:
http://techreport.com/review/28513/amd-radeon-r9-fury-x-graphics-card-reviewed/4
Since Ars Technica showed that a 290x can nearly match a GTX 980 Ti, and after looking at that memory bandwidth graph, we can tell that memory bandwidth is not what grants the Fury-X its lead.
Since a 290x can nearly match a GTX 980 Ti and a GTX 980 Ti is a near match to a Fury-X then we can conclude that the 290x and the Fury-X are a near match under Ashes of the Singularity. This points to a common bottleneck between both Hawaii and Fiji architectures.
So we have to look at the nature of Ashes of the Singularity. Ashes of the Singularity does two things in a big way.
1. Makes ample use of Asynchronous Shading.
2. Draws MANY units onto the screen (requiring many Triangles or Polygons).
Since both Fury-X an the 290x share the same Asynchronous Compute Engines, but with Fury-X having more compute resources at its disposal, then we can conclude than if Asynchronous Shading and Compute resources was the bottleneck for Fiji and Hawaii... we'd see Fiji fairing better than Hawaii. this is not the case.
Since both Fiji and Hawaii retain the same amount of Hardware Rasterizers (and the same Peak Rasterization rate expressed in Gtris/s) we can conclude that both are bottlenecked by their Peak Rasterization rate (ability to draw triangles/polygons).
Since the GTX 980 Ti has a much higher peak rasterization rate, we would expect the GTX 980 Ti to overpower the Fiji and Hawaii cards, this is not the case. Therefore we can conclude that the GTX 980 Ti is being limited by its Asynchronous Compute capabilities. We can test this hypothesis by looking at another benchmark: Star Swarm. Which draws many triangles/polygons onto the screen but which makes no use of Asynchronous Shading.
Just as expected.
Fiji and Hawaii are bottlenecked by their Peak Rasterization rates under Ashes of the Singularity while Maxwell 2 is bottlenecked by its ability to handle Asynchronous Shading.
That's my conclusion.